This exercise is designed to familiarize you with simulating community assembly data in CAMI, the randomForest
R package, and building a community assembly model classifier.
We have already installed CAMI in the R environment available on Abel, so we will not need to mess with installation today. There are installation instructions available, if needed for another time though. To work with CAMI on the cluster, we are going to use the same jupyter notebook job we began running this morning. We will begin by logging in to abel using the ssh
command and your username @ abel.uio.no
. Once logged-in, make a new directory to hold all the scripts we will create using CAMI.
Check and make sure the jupyter notebook job is still running. Run the code in a terminal connected to abel.
squeue -u <username>
You should see your the information associated with the job you submitted earlier. If not, go here, for help.
In another terminal on your local device, make sure the ssh tunnel is still running between your local device and the cluster.
ps -ef | grep ssh | grep abel
If it is running you should see a line with ssh -N -f -L
. If not, re-connect the ssh tunnel.
ssh -N -f -L <my_port_num>:<compute_node>:<my_port_num> <username>@abel.uio.no>
Navigate to http://localhost:<my_port_#>
on a web broswer.
If you’re having issues with any of part of this, see the troubleshooting documentation.
Once inside your notebook directory, navigate to the CAMI directory you created earlier. In here, created a new jupyter notebook useing the new
drop-down list in the upper-right side of the screen. Select a python 3
notebook.
To run R through jupyter notebooks, we will use rpy2
, which is an interface to R running embedded in a Python process (see more on rpy2). For R to work in this notebook, the first cell must load rpy2.
%load_ext rpy2.ipython
And every cell thereafter, must begin with %%R
, otherwise the code will not be run as R in that chunk and an error will occur. You could also run CAMI in a terminal in jupyter notebooks (select terminal
in the dropdown menu instead of python 3
). However, unlike the terminal, jupyter notebooks enables interactive plotting.
Load CAMI and review the available functions along with their help documentation.
%%R
library(CAMI)
To see the functions available in CAMI, or and R package for that matter, you can use name of the funcion followed by ::
and tab complete. If we do this for CAMI, CAMI::
, we can see that there are three functions. Use ?
before the name of each function to see the help documentation.
For full instructions on how to install CAMI, see CAMI installation.
To simulate community assembly data we will be using the SimCommunityAssembly()
function. This function simulates phylogenetic and phenotypic community assembly data in a species-based model. The function takes several parameters, which can be seen in the help documentation using ?SimCommunityAssembly
. This function can be used to simulate 1 to many community assembly datasets, set using the sims
argument.
For a single simulation, first, a regional community phylogeny is simulated under a constant birth-death process. CAMI
uses the function sim.bd.taxa
in the R package TreeSim
(Stadler 2011).
N
determines the size of the regional species poolslambda
is the speciation rate in species/million yearseps
is the extinction fraction, which is used to calculate the extinction rate from the speciation rateTrait evolution is then modeled along this phylogeny under one of two models of trait evolution, Brownian Motion (BM; Felsenstein 1985) and Ornstein-Uhlenbeck (OU; Hansen 1997, Butler & King 2004), determined by the traitsim
argument.
traitsim
either “BM” or “OU”sig2
is the instantaneous rate of character change for either modelalpha
controls the strength of pull to the trait optimum and is for OU models onlyOnce the regional community pool exists, the species are assembled into local communities under one of three assembly models, neutral, habitat filtering, and competitive exclusion, set using the comsim
argument. The non-neutral assembly models use the phenotypic variation in the regional community and phenotypic matching and repulsion equations, for filtering and competition models respectively, to determine which species will survive in the local community.
comsim
either “neutral”, “filtering”, “competition”, or “all”tau
controls the strength of the non-neutral assembly processesFor filtering models, the lower tau
is, the stronger the environmental filter. For competition models, the higher tau
, the stronger the effects of competition.
For all of these parameters, a uniform prior distribution of parameter values or a fixed value can be designated. Many of the parameters have default uniform prior distributions, see the help page for details on each parameter.
We will start with one simulation under a neutral model of assembly and BM trait evolution. The regional community pool will consist of 100 species and the local community pool will only have 50. All other parameters will be kept under their default uniform prior ranges.
## simulate 500 community datasets under all models of assembly
## leave lambda, eps, sig2, and tau with their default prior ranges
%%R
Assembly.Data <- SimCommunityAssembly(sims = 1,
N = 100,
local = 50,
traitsim = "BM",
comsim = "neutral",
## do you want summmary stats calculated?
output.sum.stats = TRUE,
## do you want dispersion stats calculated?
output.phydisp.stats = FALSE)
Note the additional flag options. If dispersion metrics should be calculated for each simulated community, that flag can be turned on. In addition to that, the summary statistics flag can be turned off.
The output is a list of two data frames. The first is of all parameters and the second contains all summary statistics. In both cases, each row in the matrix is a unique simulation and the rows between matrices correspond. The $
operator can be used to look at each matrix within the list.
%%R
Assembly.Data
## $params
## sim comsim traitsim N local lambda mu sig2 alpha tau
## 1 1 neutral BM 100 50 0.059 0.0261 2.6355 0.0116 13.191
##
## $summary.stats
## Mean.BL Var.BL Mean.Tr Var.Tr Moran.I Age Colless
## [1,] 15.97664 224.7605 15.42303 173.8518 -0.0149003 102.6825 0.1496599
## Sackin nLTT Msig Cvar Svar Shgt
## [1,] 384 0.09249637 2.761319 -7.202783 -0.001419038 0.0004673834
## Dcdf Skw Krts MnRegTr VrRegTr MnTrDif VrTrDif
## [1,] 0.2073469 -0.8818318 0.9240592 18.16064 149.538 2.737606 -24.31383
## MnRegBl VarRegBl MnBlDif VarBlDif Amp1 Amp2 BiCoef
## [1,] 11.01841 161.7632 -4.958229 -62.99729 -17.97943 16.63985 0.4317901
## BiRatio Mode1 Mode2
## [1,] 8.190638 15 5
We will look at this output more in depth momentarily. First, let’s simulate much more community data.
Simulate 500 community datasets under all assembly models. We will fix the size of the regional pool and put a prior range on the size of the local species pool. All other parameters are still set with their default prior ranges.
## simulate 500 community datasets under all models of assembly
## leave lambda, eps, sig2, and tau with their default prior ranges
%%R
Assembly.Data <- SimCommunityAssembly(sims = 500,
N = c(100,200),
local = c(20, 70),
traitsim = "BM",
comsim = "all",
lambda = c(0.05, 2.0),
eps = c(0.2, 0.8),
sig2 = c(.05, 3.0),
alpha = c(0.01, 0.2),
tau = c(1, 30))
The parameters are important to record for each simulation because, much like in ABC, randomForest
relies on using the known paramter values for simulations to make estimate these parameters for empirical data.
%%R
## the first matrix is the parameters for each simulation
head(Assembly.Data$params)
## sim comsim traitsim N local lambda mu sig2 alpha tau
## 1 1 neutral BM 157 28 0.2929 0.0944 0.1642 0.1269 21.8167
## 2 2 neutral BM 101 36 0.8318 0.2871 1.9829 0.0960 15.1303
## 3 3 neutral BM 117 64 1.9354 0.9437 0.6315 0.1870 15.3789
## 4 4 competition BM 193 47 1.1988 0.5286 1.0185 0.1268 14.0501
## 5 5 competition BM 190 54 1.1928 0.2570 2.7769 0.0674 11.0032
## 6 6 neutral BM 118 60 1.2546 0.3420 0.2423 0.0404 3.3445
Note that we record the extinction rate as
mu
rather than the extinction fractioneps
.
As a sanity check, let’s make sure that we are actually simulating under each model and verify that the parameters drawn match the prior distributions that were designated.
%%R
## this line lists the first 50 models simulated under
Assembly.Data$params$comsim[1:50]
## see how many of each model were simulated; change "neutral" to "filtering" and "competition"
sum(Assembly.Data$params$comsim=="neutral")
## [1] "neutral" "neutral" "neutral" "competition" "competition"
## [6] "neutral" "filtering" "neutral" "neutral" "neutral"
## [11] "filtering" "neutral" "neutral" "neutral" "neutral"
## [16] "filtering" "filtering" "neutral" "filtering" "filtering"
## [21] "neutral" "competition" "competition" "neutral" "competition"
## [26] "competition" "filtering" "neutral" "competition" "filtering"
## [31] "competition" "filtering" "competition" "competition" "filtering"
## [36] "competition" "filtering" "competition" "filtering" "neutral"
## [41] "neutral" "competition" "neutral" "competition" "filtering"
## [46] "filtering" "neutral" "competition" "neutral" "neutral"
## [1] 172
The number of simulations per model will be different for everyone, but hopefully each model has been simulated roughly 1/3 of the time. It is important that each model is equally represented in the training dataset in order to avoid a bias training dataset and ultimately, forest.
To verify the prior distributions, we can plot a histogram of all of the values used for the simulations for a given parameter and verify that it matches the prior distribution designated. Let’s look at the sig2
parameter, which we put a uniform prior on between 1.0 and 5.0.
%%R
library(ggplot2)
ggplot(data=Assembly.Data$params, aes(Assembly.Data$params$sig2)) +
geom_histogram(binwidth = .1) +
xlab("sig2")
We should see a uniform distribution between the upper and lower bound set. Have a look at some of the other parameters and verify the values simulated under match what was set.
The summary statistics capture information about the phylogenetic and phenotypic variation across the regional and local community, as well as the interactions between the phylogenetic and phenotypic information in the local community. Go here for a short description of each summary statistic.
%%R
head(Assembly.Data$summary.stats)
## Mean.BL Var.BL Mean.Tr Var.Tr Moran.I Age
## [1,] 4.1829967 10.9563572 0.4459810 3.4079690 -0.032986215 15.355239
## [2,] 1.4724120 1.5286311 -4.2764583 12.3738968 -0.050420048 8.056567
## [3,] 0.5354148 0.3037981 -0.6082282 2.2198399 -0.022676439 4.421203
## [4,] 1.2314329 1.5305009 1.4365781 13.7181282 -0.006361592 7.681965
## [5,] 0.8282950 0.5285885 0.2022670 19.1669278 0.006264184 4.683508
## [6,] 0.6656100 0.4884092 -0.8134269 0.7118276 0.004340900 4.312135
## Colless Sackin nLTT Msig Cvar Svar
## [1,] 0.13960114 153 0.37966532 0.2336396 6.493188 0.02772688
## [2,] 0.13949580 219 0.15453967 2.0555128 3.718787 0.02085285
## [3,] 0.08602151 462 0.05794164 0.9365303 7.869348 -0.01494577
## [4,] 0.11111111 311 0.16051472 1.4078031 -16.743998 0.03997067
## [5,] 0.11320755 388 0.15439233 2.8824283 6.854969 0.27807433
## [6,] 0.12682642 471 0.08493627 0.2320015 -10.529286 0.02402632
## Shgt Dcdf Skw Krts MnRegTr VrRegTr
## [1,] -0.06438150 0.24735450 -0.6532330 -0.7789171 0.7400729 2.2568161
## [2,] 0.02050337 0.27460317 0.4536860 -0.5511573 -4.0098554 13.8183082
## [3,] -0.04848929 0.12375992 0.1439697 -0.7312021 -0.5982307 2.1947452
## [4,] -0.02514935 0.20259019 -0.3066266 -1.5823521 1.4978095 6.0637442
## [5,] -0.21436576 0.13731656 0.2925661 -1.4253231 0.3973314 11.9544410
## [6,] -0.07647524 0.08220339 0.7637142 0.1069851 -0.8314698 0.8119077
## MnTrDif VrTrDif MnRegBl VarRegBl MnBlDif VarBlDif
## [1,] 0.294091903 -1.15115294 1.9310234 5.2139365 -2.2519733 -5.7424206
## [2,] 0.266602944 1.44441142 0.8170718 0.8040395 -0.6553402 -0.7245916
## [3,] 0.009997534 -0.02509467 0.4024024 0.2025323 -0.1330124 -0.1012658
## [4,] 0.061231379 -7.65438395 0.5917640 0.5037069 -0.6396688 -1.0267940
## [5,] 0.195064442 -7.21248673 0.4537958 0.2146349 -0.3744992 -0.3139535
## [6,] -0.018042915 0.10008018 0.4754172 0.2653841 -0.1901929 -0.2230252
## Amp1 Amp2 BiCoef BiRatio Mode1 Mode2
## [1,] -2.1123160 0.373064303 0.5517710 0.5600485 0 10
## [2,] -3.1506458 0.114780635 0.4426296 0.0000000 -2 8
## [3,] -1.9167600 0.001533011 0.4222886 1.3351484 0 30
## [4,] -2.6587590 4.022132903 0.6737785 1.4928994 5 7
## [5,] -3.1182512 4.093522492 0.6195332 0.8292886 4 8
## [6,] -0.9308271 0.460414630 0.4848018 0.0000000 -1 28
We can look at the distribution of values for each summary statistic and partition them by model.
%%R
## have a look at all of the summary statistics
colnames(Assembly.Data$summary.stats)
## replace HERE with a name of one of the summary statistics
stats.by.model <- data.frame(model=Assembly.Data$params$comsim,
stat = Assembly.Data$summary.stats[,"Shgt"])
ggplot(stats.by.model, aes(x=stat, fill=model)) +
geom_density(alpha=0.8) +
scale_fill_manual(values=c("#005679", "#00B7C4", "#FF8800"))
## [1] "Mean.BL" "Var.BL" "Mean.Tr" "Var.Tr" "Moran.I" "Age"
## [7] "Colless" "Sackin" "nLTT" "Msig" "Cvar" "Svar"
## [13] "Shgt" "Dcdf" "Skw" "Krts" "MnRegTr" "VrRegTr"
## [19] "MnTrDif" "VrTrDif" "MnRegBl" "VarRegBl" "MnBlDif" "VarBlDif"
## [25] "Amp1" "Amp2" "BiCoef" "BiRatio" "Mode1" "Mode2"
The randomForest
package implements Breiman’s ensemble random forest algorithm for classification and regression (Breiman 2001). Ensemble-ing is a “type of supervised learning technique where multiple models are trained on a training dataset and their individual outputs are combined by some rule to derive the final output.” Essentially, training data are used to build many classification trees and the predictions from each tree are combined for a final inference. The “rule” that governs how outputs are combined is most commonly averaging (for numerical data) or votes (for categorical). In our case, the predictions are model identities.
For more information on the randomForest
R package and uses go here.
For an i
%%R
library(randomForest)
To construct a classifier, the summary statistics are the predictor variables of the model, and what are used to construct the decision trees. The response variables are the model identities the summary statistics correspond to, and all decision trees end
Let’s now build a random forest classifier with the community assembly data we have simulated, and then dive into more properties of the classifier.
%%R
## The response variables
modelIndex <- Assembly.Data$params[,"comsim"]
## Organize the predictor variables with the response
ReferenceTable <- data.frame(Assembly.Data$summary.stats, modelIndex)
## build the classifier using all predictors,
RF.model <- randomForest(modelIndex ~., data=ReferenceTable, ntree=1000, importance=T)
As mentioned before, RF is a supervised machine learning algorithm meaning that we are relying on training data to define our model in order to make accurate prediction. A major concern when making these classifiers is that the data are not sufficient to distinguish between the models. RF internally validates that accuracy of any given classifier by withholding some of the provided data in order to cross-validate the decision trees while they’re being built. Essentially, we can see how often each model is being incorrectly classified, known as the Out-of-Bag (OOB) error rate.
%%R
RF.model
##
## Call:
## randomForest(formula = modelIndex ~ ., data = ReferenceTable, ntree = 1000, importance = T)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 18.6%
## Confusion matrix:
## competition filtering neutral class.error
## competition 151 0 12 0.07361963
## filtering 3 129 33 0.21818182
## neutral 12 33 127 0.26162791
The error rates for a classifier are not always final. Classifiers can be tuned to improve accuracy in classification; we will learn more about this tomorrow. For now, though, we can plot the forest object and see how the error rates change as trees are added to the forest. One of the biggest advantages of the enemble approach is that it adds predictive power to the forest, because alone, decision trees are weak predictors.
%%R
plot(RF.model, col=c( "gray50", "#005679","#00B7C4", "#FF8800"), lwd=2)
legend("topright", legend=c("filtering", "competition", "neutral", "average"), fill=c("#00B7C4", "#005679", "#FF8800", "gray50"), bty="n", cex=1.2)
Breimen L. Random Forests. 2001 Machine Learning, 45, 5-32. link
Butler, M.A. & King, A.A. 2004. Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution. Am. Nat.,164, 683-695. link
Felsenstein, J. 1985. Phylogenies and the Comparative Method. Am. Nat. 125, 1-15. link
Hansen, T.F. 1997. Stabilizing Selection and the Comparative Analysis of Adaptation. Evolution. 51, 1341-1351. link
Stadler, T. 2011. Simulating trees with a fixed number of extant species. Syst. Biol., 60, 676–684. link