CAMI tutorial 1

1. Introduction

This exercise is designed to familiarize you with simulating community assembly data in CAMI, the randomForest R package, and building a community assembly model classifier.

We have already installed CAMI in the R environment available on Abel, so we will not need to mess with installation today. There are installation instructions available, if needed for another time though. To work with CAMI on the cluster, we are going to use the same jupyter notebook job we began running this morning. We will begin by logging in to abel using the ssh command and your username @ abel.uio.no. Once logged-in, make a new directory to hold all the scripts we will create using CAMI.

a) abel and jupyter notebooks

Check and make sure the jupyter notebook job is still running. Run the code in a terminal connected to abel.

squeue -u <username>

You should see your the information associated with the job you submitted earlier. If not, go here, for help.

In another terminal on your local device, make sure the ssh tunnel is still running between your local device and the cluster.

ps -ef | grep ssh | grep abel

If it is running you should see a line with ssh -N -f -L. If not, re-connect the ssh tunnel.

ssh -N -f -L <my_port_num>:<compute_node>:<my_port_num> <username>@abel.uio.no>

Navigate to http://localhost:<my_port_#> on a web broswer.

If you’re having issues with any of part of this, see the troubleshooting documentation.

b) R in jupyter notebooks

Once inside your notebook directory, navigate to the CAMI directory you created earlier. In here, created a new jupyter notebook useing the new drop-down list in the upper-right side of the screen. Select a python 3 notebook.

To run R through jupyter notebooks, we will use rpy2, which is an interface to R running embedded in a Python process (see more on rpy2). For R to work in this notebook, the first cell must load rpy2.

%load_ext rpy2.ipython

And every cell thereafter, must begin with %%R, otherwise the code will not be run as R in that chunk and an error will occur. You could also run CAMI in a terminal in jupyter notebooks (select terminal in the dropdown menu instead of python 3). However, unlike the terminal, jupyter notebooks enables interactive plotting.

Load CAMI and review the available functions along with their help documentation.

%%R
library(CAMI)

To see the functions available in CAMI, or and R package for that matter, you can use name of the funcion followed by :: and tab complete. If we do this for CAMI, CAMI::, we can see that there are three functions. Use ? before the name of each function to see the help documentation.

For full instructions on how to install CAMI, see CAMI installation.

2. Simulate Data

To simulate community assembly data we will be using the SimCommunityAssembly() function. This function simulates phylogenetic and phenotypic community assembly data in a species-based model. The function takes several parameters, which can be seen in the help documentation using ?SimCommunityAssembly. This function can be used to simulate 1 to many community assembly datasets, set using the sims argument.

For a single simulation, first, a regional community phylogeny is simulated under a constant birth-death process. CAMI uses the function sim.bd.taxa in the R package TreeSim (Stadler 2011).

  • N determines the size of the regional species pools
  • lambda is the speciation rate in species/million years
  • eps is the extinction fraction, which is used to calculate the extinction rate from the speciation rate

Trait evolution is then modeled along this phylogeny under one of two models of trait evolution, Brownian Motion (BM; Felsenstein 1985) and Ornstein-Uhlenbeck (OU; Hansen 1997, Butler & King 2004), determined by the traitsim argument.

  • traitsim either “BM” or “OU”
  • sig2 is the instantaneous rate of character change for either model
  • alpha controls the strength of pull to the trait optimum and is for OU models only

Once the regional community pool exists, the species are assembled into local communities under one of three assembly models, neutral, habitat filtering, and competitive exclusion, set using the comsim argument. The non-neutral assembly models use the phenotypic variation in the regional community and phenotypic matching and repulsion equations, for filtering and competition models respectively, to determine which species will survive in the local community.

  • comsim either “neutral”, “filtering”, “competition”, or “all”
  • tau controls the strength of the non-neutral assembly processes

For filtering models, the lower tau is, the stronger the environmental filter. For competition models, the higher tau, the stronger the effects of competition.

For all of these parameters, a uniform prior distribution of parameter values or a fixed value can be designated. Many of the parameters have default uniform prior distributions, see the help page for details on each parameter.

We will start with one simulation under a neutral model of assembly and BM trait evolution. The regional community pool will consist of 100 species and the local community pool will only have 50. All other parameters will be kept under their default uniform prior ranges.

## simulate 500 community datasets under all models of assembly
## leave lambda, eps, sig2, and tau with their default prior ranges
%%R
Assembly.Data <- SimCommunityAssembly(sims = 1, 
                                     N = 100, 
                                     local = 50,
                                     traitsim = "BM",
                                     comsim = "neutral",
                                     ## do you want summmary stats calculated?
                                     output.sum.stats = TRUE,
                                     ## do you want dispersion stats calculated?
                                     output.phydisp.stats = FALSE)

Note the additional flag options. If dispersion metrics should be calculated for each simulated community, that flag can be turned on. In addition to that, the summary statistics flag can be turned off.

The output is a list of two data frames. The first is of all parameters and the second contains all summary statistics. In both cases, each row in the matrix is a unique simulation and the rows between matrices correspond. The $ operator can be used to look at each matrix within the list.

%%R
Assembly.Data
## $params
##   sim  comsim traitsim   N local lambda     mu   sig2  alpha    tau
## 1   1 neutral       BM 100    50  0.059 0.0261 2.6355 0.0116 13.191
## 
## $summary.stats
##       Mean.BL   Var.BL  Mean.Tr   Var.Tr    Moran.I      Age   Colless
## [1,] 15.97664 224.7605 15.42303 173.8518 -0.0149003 102.6825 0.1496599
##      Sackin       nLTT     Msig      Cvar         Svar         Shgt
## [1,]    384 0.09249637 2.761319 -7.202783 -0.001419038 0.0004673834
##           Dcdf        Skw      Krts  MnRegTr VrRegTr  MnTrDif   VrTrDif
## [1,] 0.2073469 -0.8818318 0.9240592 18.16064 149.538 2.737606 -24.31383
##       MnRegBl VarRegBl   MnBlDif  VarBlDif      Amp1     Amp2    BiCoef
## [1,] 11.01841 161.7632 -4.958229 -62.99729 -17.97943 16.63985 0.4317901
##       BiRatio Mode1 Mode2
## [1,] 8.190638    15     5

We will look at this output more in depth momentarily. First, let’s simulate much more community data.

Simulate 500 community datasets under all assembly models. We will fix the size of the regional pool and put a prior range on the size of the local species pool. All other parameters are still set with their default prior ranges.

## simulate 500 community datasets under all models of assembly
## leave lambda, eps, sig2, and tau with their default prior ranges
%%R
Assembly.Data <- SimCommunityAssembly(sims = 500, 
                                     N = c(100,200), 
                                     local = c(20, 70),
                                     traitsim = "BM",
                                     comsim = "all",
                                     lambda = c(0.05, 2.0),
                                     eps = c(0.2, 0.8),
                                     sig2 = c(.05, 3.0),
                                     alpha = c(0.01, 0.2),
                                     tau = c(1, 30))

a) parameters

The parameters are important to record for each simulation because, much like in ABC, randomForest relies on using the known paramter values for simulations to make estimate these parameters for empirical data.

%%R
## the first matrix is the parameters for each simulation
head(Assembly.Data$params)
##   sim      comsim traitsim   N local lambda     mu   sig2  alpha     tau
## 1   1     neutral       BM 157    28 0.2929 0.0944 0.1642 0.1269 21.8167
## 2   2     neutral       BM 101    36 0.8318 0.2871 1.9829 0.0960 15.1303
## 3   3     neutral       BM 117    64 1.9354 0.9437 0.6315 0.1870 15.3789
## 4   4 competition       BM 193    47 1.1988 0.5286 1.0185 0.1268 14.0501
## 5   5 competition       BM 190    54 1.1928 0.2570 2.7769 0.0674 11.0032
## 6   6     neutral       BM 118    60 1.2546 0.3420 0.2423 0.0404  3.3445

Note that we record the extinction rate as mu rather than the extinction fraction eps.

As a sanity check, let’s make sure that we are actually simulating under each model and verify that the parameters drawn match the prior distributions that were designated.

%%R
## this line lists the first 50 models simulated under 
Assembly.Data$params$comsim[1:50]

## see how many of each model were simulated; change "neutral" to "filtering" and "competition"
sum(Assembly.Data$params$comsim=="neutral")
##  [1] "neutral"     "neutral"     "neutral"     "competition" "competition"
##  [6] "neutral"     "filtering"   "neutral"     "neutral"     "neutral"    
## [11] "filtering"   "neutral"     "neutral"     "neutral"     "neutral"    
## [16] "filtering"   "filtering"   "neutral"     "filtering"   "filtering"  
## [21] "neutral"     "competition" "competition" "neutral"     "competition"
## [26] "competition" "filtering"   "neutral"     "competition" "filtering"  
## [31] "competition" "filtering"   "competition" "competition" "filtering"  
## [36] "competition" "filtering"   "competition" "filtering"   "neutral"    
## [41] "neutral"     "competition" "neutral"     "competition" "filtering"  
## [46] "filtering"   "neutral"     "competition" "neutral"     "neutral"
## [1] 172

The number of simulations per model will be different for everyone, but hopefully each model has been simulated roughly 1/3 of the time. It is important that each model is equally represented in the training dataset in order to avoid a bias training dataset and ultimately, forest.

To verify the prior distributions, we can plot a histogram of all of the values used for the simulations for a given parameter and verify that it matches the prior distribution designated. Let’s look at the sig2 parameter, which we put a uniform prior on between 1.0 and 5.0.

%%R
library(ggplot2)
ggplot(data=Assembly.Data$params, aes(Assembly.Data$params$sig2)) + 
  geom_histogram(binwidth = .1) +
  xlab("sig2")

We should see a uniform distribution between the upper and lower bound set. Have a look at some of the other parameters and verify the values simulated under match what was set.

b) summary statistics

The summary statistics capture information about the phylogenetic and phenotypic variation across the regional and local community, as well as the interactions between the phylogenetic and phenotypic information in the local community. Go here for a short description of each summary statistic.

%%R
head(Assembly.Data$summary.stats)
##        Mean.BL     Var.BL    Mean.Tr     Var.Tr      Moran.I       Age
## [1,] 4.1829967 10.9563572  0.4459810  3.4079690 -0.032986215 15.355239
## [2,] 1.4724120  1.5286311 -4.2764583 12.3738968 -0.050420048  8.056567
## [3,] 0.5354148  0.3037981 -0.6082282  2.2198399 -0.022676439  4.421203
## [4,] 1.2314329  1.5305009  1.4365781 13.7181282 -0.006361592  7.681965
## [5,] 0.8282950  0.5285885  0.2022670 19.1669278  0.006264184  4.683508
## [6,] 0.6656100  0.4884092 -0.8134269  0.7118276  0.004340900  4.312135
##         Colless Sackin       nLTT      Msig       Cvar        Svar
## [1,] 0.13960114    153 0.37966532 0.2336396   6.493188  0.02772688
## [2,] 0.13949580    219 0.15453967 2.0555128   3.718787  0.02085285
## [3,] 0.08602151    462 0.05794164 0.9365303   7.869348 -0.01494577
## [4,] 0.11111111    311 0.16051472 1.4078031 -16.743998  0.03997067
## [5,] 0.11320755    388 0.15439233 2.8824283   6.854969  0.27807433
## [6,] 0.12682642    471 0.08493627 0.2320015 -10.529286  0.02402632
##             Shgt       Dcdf        Skw       Krts    MnRegTr    VrRegTr
## [1,] -0.06438150 0.24735450 -0.6532330 -0.7789171  0.7400729  2.2568161
## [2,]  0.02050337 0.27460317  0.4536860 -0.5511573 -4.0098554 13.8183082
## [3,] -0.04848929 0.12375992  0.1439697 -0.7312021 -0.5982307  2.1947452
## [4,] -0.02514935 0.20259019 -0.3066266 -1.5823521  1.4978095  6.0637442
## [5,] -0.21436576 0.13731656  0.2925661 -1.4253231  0.3973314 11.9544410
## [6,] -0.07647524 0.08220339  0.7637142  0.1069851 -0.8314698  0.8119077
##           MnTrDif     VrTrDif   MnRegBl  VarRegBl    MnBlDif   VarBlDif
## [1,]  0.294091903 -1.15115294 1.9310234 5.2139365 -2.2519733 -5.7424206
## [2,]  0.266602944  1.44441142 0.8170718 0.8040395 -0.6553402 -0.7245916
## [3,]  0.009997534 -0.02509467 0.4024024 0.2025323 -0.1330124 -0.1012658
## [4,]  0.061231379 -7.65438395 0.5917640 0.5037069 -0.6396688 -1.0267940
## [5,]  0.195064442 -7.21248673 0.4537958 0.2146349 -0.3744992 -0.3139535
## [6,] -0.018042915  0.10008018 0.4754172 0.2653841 -0.1901929 -0.2230252
##            Amp1        Amp2    BiCoef   BiRatio Mode1 Mode2
## [1,] -2.1123160 0.373064303 0.5517710 0.5600485     0    10
## [2,] -3.1506458 0.114780635 0.4426296 0.0000000    -2     8
## [3,] -1.9167600 0.001533011 0.4222886 1.3351484     0    30
## [4,] -2.6587590 4.022132903 0.6737785 1.4928994     5     7
## [5,] -3.1182512 4.093522492 0.6195332 0.8292886     4     8
## [6,] -0.9308271 0.460414630 0.4848018 0.0000000    -1    28

We can look at the distribution of values for each summary statistic and partition them by model.

%%R
## have a look at all of the summary statistics
colnames(Assembly.Data$summary.stats) 

## replace HERE with a name of one of the summary statistics
stats.by.model <- data.frame(model=Assembly.Data$params$comsim, 
                  stat = Assembly.Data$summary.stats[,"Shgt"])

ggplot(stats.by.model, aes(x=stat, fill=model)) +
  geom_density(alpha=0.8) +
  scale_fill_manual(values=c("#005679", "#00B7C4", "#FF8800")) 
##  [1] "Mean.BL"  "Var.BL"   "Mean.Tr"  "Var.Tr"   "Moran.I"  "Age"     
##  [7] "Colless"  "Sackin"   "nLTT"     "Msig"     "Cvar"     "Svar"    
## [13] "Shgt"     "Dcdf"     "Skw"      "Krts"     "MnRegTr"  "VrRegTr" 
## [19] "MnTrDif"  "VrTrDif"  "MnRegBl"  "VarRegBl" "MnBlDif"  "VarBlDif"
## [25] "Amp1"     "Amp2"     "BiCoef"   "BiRatio"  "Mode1"    "Mode2"

3. Random Forest

The randomForest package implements Breiman’s ensemble random forest algorithm for classification and regression (Breiman 2001). Ensemble-ing is a “type of supervised learning technique where multiple models are trained on a training dataset and their individual outputs are combined by some rule to derive the final output.” Essentially, training data are used to build many classification trees and the predictions from each tree are combined for a final inference. The “rule” that governs how outputs are combined is most commonly averaging (for numerical data) or votes (for categorical). In our case, the predictions are model identities.

For more information on the randomForest R package and uses go here.

For an i

%%R
library(randomForest)

To construct a classifier, the summary statistics are the predictor variables of the model, and what are used to construct the decision trees. The response variables are the model identities the summary statistics correspond to, and all decision trees end

Let’s now build a random forest classifier with the community assembly data we have simulated, and then dive into more properties of the classifier.

%%R
## The response variables
modelIndex <- Assembly.Data$params[,"comsim"]

## Organize the predictor variables with the response
ReferenceTable <- data.frame(Assembly.Data$summary.stats, modelIndex)

## build the classifier using all predictors, 
RF.model <- randomForest(modelIndex ~., data=ReferenceTable, ntree=1000, importance=T)

a) error rates

As mentioned before, RF is a supervised machine learning algorithm meaning that we are relying on training data to define our model in order to make accurate prediction. A major concern when making these classifiers is that the data are not sufficient to distinguish between the models. RF internally validates that accuracy of any given classifier by withholding some of the provided data in order to cross-validate the decision trees while they’re being built. Essentially, we can see how often each model is being incorrectly classified, known as the Out-of-Bag (OOB) error rate.

%%R
RF.model
## 
## Call:
##  randomForest(formula = modelIndex ~ ., data = ReferenceTable,      ntree = 1000, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 18.6%
## Confusion matrix:
##             competition filtering neutral class.error
## competition         151         0      12  0.07361963
## filtering             3       129      33  0.21818182
## neutral              12        33     127  0.26162791

The error rates for a classifier are not always final. Classifiers can be tuned to improve accuracy in classification; we will learn more about this tomorrow. For now, though, we can plot the forest object and see how the error rates change as trees are added to the forest. One of the biggest advantages of the enemble approach is that it adds predictive power to the forest, because alone, decision trees are weak predictors.

%%R
plot(RF.model, col=c( "gray50", "#005679","#00B7C4", "#FF8800"), lwd=2)
legend("topright", legend=c("filtering", "competition", "neutral", "average"), fill=c("#00B7C4", "#005679", "#FF8800", "gray50"), bty="n", cex=1.2)

4. References

Breimen L. Random Forests. 2001 Machine Learning, 45, 5-32. link

Butler, M.A. & King, A.A. 2004. Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution. Am. Nat.,164, 683-695. link

Felsenstein, J. 1985. Phylogenies and the Comparative Method. Am. Nat. 125, 1-15. link

Hansen, T.F. 1997. Stabilizing Selection and the Comparative Analysis of Adaptation. Evolution. 51, 1341-1351. link

Stadler, T. 2011. Simulating trees with a fixed number of extant species. Syst. Biol., 60, 676–684. link