CAMI tutorial 2

1. Background

Marx et al. (2015) surveyed 442 vascular plants species across 80 islands in the San Juan archipelago, which is a series of islands in between Washington state and Vancouver Island, BC, Canada.

Islands that were sampled for this study are colored in red. The three largest unsampled island and the three largest sampled islands are labeled.

Islands that were sampled for this study are colored in red. The three largest unsampled island and the three largest sampled islands are labeled.

In this study, authors also reconstructed the community phylogeny for all of the plant species living across the islands, using a supertree approach.

Trait data were also collected for most species. The coverage of trait data across the phylogeny is indicated by the colors blocks surrounding the tips of the tree.

Trait data were also collected for most species. The coverage of trait data across the phylogeny is indicated by the colors blocks surrounding the tips of the tree.

In their community assembly analyses, Marx et al. (2015) investigated mean pairwise distance (MPD) and mean-nearest taxon distance (MNTD) using phylogenetic information, and then also using the functional trait information. They used these standardized metrics as test statistics for significant over and under-dispersion compared to randomly assembly communities in statistical hypothesis testing.

We advocate that using CAMI is an alternative to statistical hypothesis testing, as inference here is based on simulating data under each of the assembly models and performing model-selection with random forests (RF). In this, we can simultaneously compare the model support for each assembly process.

Below, we go through analyzing Marx et al.’s (2015) data in CAMI. Much like the first tutorial, we will be doing this analysis in R through jupyter notebooks. Login to abel, ensure the ssh connection, and navigate to your notebooks in a web browser. See jupyter notebook help, if stuck. Go to your CAMI directory and make a new python 3 file. Remember to load rpy2 in the first cell of your notebook.

%load_ext rpy2.ipython

2. San Juan Data

a) download data

The data are available on the CompPhylo website on the CAMI [homepage], to download directly to your local device. To download them onto you cluster account, use the linux wget command. We can run the following code to download the data directly in the terminal on abel, or we can run the command in our jupyter notebook. If we run the command in our notebook, we need to add ! to the beginning. This tells the notebook to run the command in the terminal.

## download SanJuan_Data.tar.gz
!wget https://github.com/compphylo/compphylo.github.io/raw/master/Oslo2019/CAMI_files/assets/SanJuan_Data.tar.gz

## open the file
!tar xvzf SanJuan_Data.tar.gz

Try listing the files in the directory through jupyter notebooks by running !ls SanJuan_Data/. You should see three data files; Marx.tre, CommunityDataMatrix.csv, and FunctionalTraitData.csv.

b) load data into R

Phylogenetic Tree

We will first load the community phylogenetic tree constructed by Marx et al. (2015). This tree has 367 of the species identified on the San Juan Islands. The tree is a result of combining publicly available sequence data from GenBank for five gene regions (atpB, rbcL, matK, trnTLF and ITS) using a modified supermatrix method, termed ‘mega-phylogeny’ (Smith et al., 2009) . The phylogenetic tree was inferred using Maximum Likelihood.

## we need this to run R in jupyter notebooks
%%R
library(CAMI)
## phytools has the read.tree() function we need
library(phytools)

regionalTree <-  read.tree("SanJuan_Data/Marx.tre")
regionalTree
## 
## Phylogenetic tree with 367 tips and 366 internal nodes.
## 
## Tip labels:
##  Achillea_millefolium, Anthemis_arvensis, Artemisia_campestris, Artemisia_suksdorfii, Soliva_sessilis, Grindelia_hirsutula, ...
## Node labels:
##  NA, 96, 97, 100, 100, 100, ...
## 
## Rooted; includes branch lengths.

The tree is an object of class phylo. These objects have several features that can be extracted from them; use the attributes() function to see all of them. Then use the $ operator to extract some of the attributes.

%%R
attributes(regionalTree)
## $names
## [1] "edge"        "edge.length" "Nnode"       "node.label"  "tip.label"  
## 
## $class
## [1] "phylo"
## 
## $order
## [1] "cladewise"
%%R
regionalTree$tip.label

We can also plot the object easily using plotTree() from the phytools R package.

%%R
plotTree(regionalTree, type = "fan", fsize=0.1)

Community Data Matrix

Next, we will load the community data matrix which identifies which species are located on which San Juan islands. In this matrix, the rows will the total species pool present on all islands and the columns each represent a different island. Inside the cells is then presence/absence data in the form of 0s and 1s.

%%R
## the community matrix is imported as a data frame object
communityMatrix <- read.csv(file="SanJuan_Data/CommunityDataMatrix.csv", sep=",", header=T, row.names = 1)

## check out how many species are in each community 
## the -1 just says to ignore the first column becuase we already know it is the "all" community
colSums(communityMatrix[,-1])

We see that there are some communities that have very few species, somtimes less than 20. We are going to use all communities with 20 or more species.

%%R
## only keeo communities with 20 or more species
communityMatrix <- communityMatrix[,colSums(communityMatrix) >= 20]

## get rid of the first column that is the "all" community
communityMatrix <- communityMatrix[,-1]

Functional Traits

Finally, we will load the functional trait data frame. Again, in this matrix, each row is a species and each of the columns correspond to a trait. The traits data includes native/invasive status, seed mass (mg), maximum height (m), specific leaf area (SLA, cm2), leaf size (cm2), and leaf N content.

%%R
## the functional trait data is also imported as a data frame object
traitData <-  read.csv(file="SanJuan_Data/FunctionalTraitData.csv", sep=",", header=T, row.names = 1)

head(traitData)
##                       Status Seed.Mass..mg. Maximum.Height..m.
## Abies_grandis              n     17.2154046         68.5714286
## Acer_glabrum               n     36.5803144         10.0000000
## Acer_macrophyllum          n    133.3852804         19.6972850
## Achillea_millefolium       n      0.1462084          0.4615385
## Achnatherum_lemmonii       n      5.6272923          0.9000000
## Acmispon_denticulatus      n      6.0500000                 NA
##                       SLA..cm.2.g. Leaf.Size..cm.2. Leaf.Nitrogen....
## Abies_grandis                   NA        0.3652396                NA
## Acer_glabrum              268.0890       22.5730028                NA
## Acer_macrophyllum         252.1901      128.0844641          2.640000
## Achillea_millefolium      187.2335        5.6710225          2.498042
## Achnatherum_lemmonii            NA               NA                NA
## Acmispon_denticulatus           NA               NA                NA

Note the missing data present as NAs. Use the code below to see which traits have the most missing data.

%%R
colSums(is.na(traitData))
##             Status     Seed.Mass..mg. Maximum.Height..m. 
##                  0                 93                108 
##       SLA..cm.2.g.   Leaf.Size..cm.2.  Leaf.Nitrogen.... 
##                215                225                305

c) curate data

In the rest of this tutorial, we will be investigating how Maximum height has influenced the structure of the San Juan plant communities. In the code below, we will extract the height data and continue on. If you are interested in investigating another trait other than height, replace the Maximum.Height..m. indicator in the code below to the name of another trait. If using a differnt trait, you can still continue on with the tutorial but you will need to adjust parameters in the simulations differently than is done in the tutorial. I will note when to do so. I suggest following along with height for convenience, however if you are feeling adventurous, try investigating a different trait.

%%R
## PICK TRAIT HERE (replace "Maximum.Height..m.")
regionalTraits <- traitData$Maximum.Height..m.
names(regionalTraits) <- rownames(traitData)

## check how many species we have trait information for
length(regionalTraits)

## have a look at the data
regionalTraits
## [1] 415

One thing we notice immediately is that the number of species we have trait information for is great than the number of species in the phylogenetic tree. This will be a problem for us as the species in the phylogeny must exactly match all of the species in the trait data and vice versa.

We can remedy this problem by dropping all of the species in the trait data vector that are not present in the phylogeny.

%%R
regionalTraits <- regionalTraits[names(regionalTraits) %in% regionalTree$tip.label]

We still have a problem though because there is a lot of missing trait data, and CAMI cannot handle missing trait information.

%%R
## identify the names of the species that have missing trait data
missingSpecies <- names(regionalTraits)[is.na(regionalTraits)]

## load ape for drop.tip()
library(ape)
## drop species from the phylogeny that have missing trait data
regionalTree <- drop.tip(phy = regionalTree, tip = missingSpecies)

## keep only trait data that are not NAs
regionalTraits <- regionalTraits[!is.na(regionalTraits)]

In the R package geiger there is a handy function called name.check(). This function will check whether the trait data and the phylogeny exactly match. For this, the tip labels must match the names of the trait vector.

%%R
library(geiger)
name.check(phy = regionalTree, data = regionalTraits)
## $tree_not_data
## [1] "Arctostaphylos_media"
## 
## $data_not_tree
## character(0)

Looks like there is one species in the tree that is not in the trait data. This makes sense because we never checked the tree for species that weren’t in the trait data, just the other way around. Remove this lone species now and re-check the names.

%%R
regionalTree <- drop.tip(phy = regionalTree, tip = "Arctostaphylos_media")
name.check(phy = regionalTree, data = regionalTraits)
## [1] "OK"

We got the official “OK”, so we are good to proceed.

3. Trait Evolution

First things first, since we are dealing with height data, we almost always need to log-transform our data. Have a look at the data and see why.

%%R
hist(regionalTraits, breaks=100)

%%R
hist(log10(regionalTraits), breaks=100)

Note that we don’t just log-transform for statistical convienience, especially in the case of height. This is because the differences in height between the species on the ground and herbaceous level are less dramatic, but arguably just as important as the incredibly large differences in height that occurr because of the few very tall trees and shrubs in the community. If we do not log-transform we are not weighting the importance of the small height differences on the ground level where there are many more species.

Before we can simulate community assembly data, we need to see what model of trait evolution is most appropriate for our data, Brownian Motion (BM; Felsensten 1985) or Ornstein-Uhlenbeck (OU; Hansen 1997, Butler & King 2004). For this, we will fit both models of trait evolution to our phylogenetic and trait data using the fitContinuous() function in geiger. Then we will perform model selection using an information theoric approach, AIC. Briefly, AIC estimates the quality of each model, relative to each of the other models, while dealing with the trade-off between the goodness of fit of the model and the simplicity of the model.

%%R
## fit BM and OU models
BM.mod <- fitContinuous(phy = regionalTree, dat = log10(regionalTraits)*10, model = "BM", control=list(niter=100))
OU.mod <- fitContinuous(phy = regionalTree, dat = log10(regionalTraits)*10, model = "OU", bounds=list(alpha=c(0.001, 0.02)), control=list(niter=100))

## look at one model fit object
BM.mod
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
##  sigsq = 0.852007
##  z0 = 3.420735
## 
##  model summary:
##  log-likelihood = -909.940515
##  AIC = 1823.881029
##  AICc = 1823.920503
##  free parameters = 2
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  frequency of best fit = 1.00
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates

Notice all of the information associated with fitting the BM model to our data, including the parameter estimates. What we are interested in looking at now is the AIC score and specifically, which model has the lower AIC score.

%%R
## BM AIC
BM.mod$opt$aic

## OU AIC
OU.mod$opt$aic
## [1] 1823.881
## [1] 1767.436

The Ornstein-Uhlenbeck model has a lower AIC score by about 60 units. In information theory, the model is thought to be better than another if its’ AIC score is smaller by > 2 units, so I would say the OU is a much better fit to the data. If you opted to use a differnt trait other than height, you may be seeing support for the BM model.

4. Simulate Community Assembly

Get all of the parameters set to run community assembly simulations modeled after the empirical data sample sizes and trait evolution parameters.

%%R
## set the size of the regional community equal to the size of our regional community
N.sj <- length(regionalTree$tip.label)

## this line tells use the number of species in each local community
## because we have so many local communities, we will simulate under a range of local community sizes
local.sj <- range(colSums(communityMatrix))

## We will use the parameter estimates from the OU model fit to our data
sig2.sj <- OU.mod$opt$sigsq
alpha.sj = OU.mod$opt$alpha

Now we can simulate the assembly data. Simulating this much data would maybe only take ~20 minutes on abel. But for each of us, this is a bit much. Instead of running this, let’s load some data that has already been run. If you are investigating another trait other than height, you will need to run new simulations. You can lower the number of sims if need be.

%%R
Assembly.Data <- SimCommunityAssembly(sims = 5000, 
                                     N = N.sj, 
                                     local = local.sj,
                                     traitsim = "OU",
                                     comsim = "all",
                                     lambda = c(0.05, 2.0),
                                     eps = c(0.2, 0.8),
                                     sig2 = sig2.sj,
                                     alpha = alpha.sj,
                                     tau = c(1, 30))

Load the data instead.

%%R
load(file="SanJuan_Data/Assembly.Data.Rdata")
head(Assembly.Data)

STOP HERE

5. Model Predictions

In order to use the simulated data to make predictions about the plant communities in the San Juan Island, we first need to do two things.

a) RF classifier

First, we will build our randomForest classifier and we will do this in the same way we built the first one.

%%R
library(randomForest)

## The response variables
modelIndex <- Assembly.Data$params[,"comsim"]

## Organize the predictor variables with the response
ReferenceTable <- data.frame(Assembly.Data$summary.stats, modelIndex)

## build the classifier using all predictors, 
RF.model <- randomForest(modelIndex ~., data=ReferenceTable, ntree=1000, importance=T)

## check out the OOB error rates for each model
RF.model
## 
## Call:
##  randomForest(formula = modelIndex ~ ., data = ReferenceTable,      ntree = 1000, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 4.72%
## Confusion matrix:
##             competition filtering neutral class.error
## competition        1659         0      48  0.02811951
## filtering             0      1568      83  0.05027256
## neutral              36        69    1537  0.06394641

We can see that, with this data, the error rates for classification are very low (<5%), meaning RF is doing a good job at distinguishing between which simulated data are from which model. As mentioned before, there are ways to tune your classifier if the error rates are unsatisfying. We will learn more about these techniques throughout the week, but for now, our classifier is working fine.

b) summary statistics

In order to use our classifier, we must have the same summary statistics for the empirical data as we do for the simulated data. To do this, we will use the code below to go through and create all of the phylogenetic trees and trait vectors for each of the San Juan communities. We will need this information to calculate the summary statistics.

%%R
## first, remove all of the species from the community matrix for which there are no traits
communityMatrix <- communityMatrix[rownames(communityMatrix) %in%  names(regionalTraits),]

## use a for look to go through each local community and do 2 things
## 1. create  a local phylogeny for each community
## 2. isolate the traits for each local community

allLocalTraits <- list()

for (i in 1:ncol(communityMatrix)) {
  
  ## local phylogeny
  species.to.drop <- rownames(communityMatrix)[communityMatrix[,i]==0]
  localTree <- drop.tip(regionalTree, species.to.drop)
  if (i == 1){ 
    allLocalTrees <- localTree } ## if this is the first tree, create tree list
  else{  allLocalTrees <- c(allLocalTrees, localTree)
  }
  ## local traits
  localTraits <- regionalTraits[match(localTree$tip.label, names(regionalTraits))]
  allLocalTraits[[i]] <- localTraits
}

We can name.check() to check some of the the phylogeny - trait pairings to make sure they match up. Change the index and check some pairings to make sure they match.

%%R
index = 1
name.check(phy = allLocalTrees[[i]], data = allLocalTraits[[i]])

Now that the data is ready, we can calculate the summary statistics for each community and make model predictions. CAMI has a function CalcSummaryStats() that takes a regional and local phylogeny and regional and local trait vectors.

## calculate summary statistics
allSummaryStats <- matrix(NA, ncol(communityMatrix), 30)
for (i in 1:ncol(communityMatrix)){
  stats<-  CalcSummaryStats(regional.tree = regionalTree,
                                      local.tree = allLocalTrees[[i]],
                                      regional.traits = log10(regionalTraits)*10,
                                      local.traits = log10(allLocalTraits[[i]])*10)
  allSummaryStats[i,] <- stats
  colnames(allSummaryStats) <- names(stats)
}

## add rownames of communities
rownames(allSummaryStats) <- colnames(communityMatrix)

## have a look at all of the summary statistics
allSummaryStats

Make RF predicitons for each community and plot the results.

%%R
## use the predict() function from randomForest
modelPredictions <- predict(RF.model, allSummaryStats, type="vote")

## transpose the predictions for plotting
barplot(t(modelPredictions), col=c("#005679", "#00B7C4", "#FF8800") , border="white", xlab="San Juan Communities", cex.names = 0.1)
legend("topleft", legend = c("competition", "filtering", "neutral"), fill = c("#005679", "#00B7C4", "#FF8800"), cex=.8)

## if desired, you can save the figure in your folder 
pdf('SJModelPredictions_MaxHght_Barplot.pdf', width=5, height=4)

Here we see that a majority of communities support the neutral model of assembly the most. However, there are several communities that show strong support for the filtering model, and other communities that show moderate support for competition.

Time and data permitting, we might dive into these results more and ask how the support for different models is related to other island metadata, likesize, area, and age of island.

6. Parameter Estimation

Finally, we will perform parameter estimation using randomForest, except instead of using a RF for classification of categorical variables, we will use it for regression and the prediction of continuous variables. In a regression tree, since the target variable is a real valued number, we fit a regression model to the target variable using each of the independent, or predictor variables. Then for each predictor variable, the data is split at several split points. We calculate Sum of Squared Error(SSE) at each split point between the predicted value and the actual values. The variable resulting in minimum SSE is selected for the node. Then this process is recursively continued till the entire data is covered. We will cover regression trees more indepth tomorrow.

We will only estimate the parameters for all of the communities that have high support for the filtering model.

$$R
## we will predict for any community with >70% model support for filtering
modelPredictions[modelPredictions[, "filtering"] > .70,]

## isolate the summary statistics for these communities
filtComSumStats <- allSummaryStats[rownames(modelPredictions[modelPredictions[, "filtering"] > .70,]), ]
##                             competition filtering neutral
## Dotlet_Island                     0.004     0.805   0.191
## East_Sucia_2_Island               0.022     0.724   0.254
## East_Sucia_3_Island               0.025     0.780   0.195
## Little_Cactus                     0.011     0.840   0.149
## Mackaye_Harbor_Island             0.023     0.738   0.239
## Massacre_Bay__1_Island            0.030     0.723   0.247
## Swirl_Rock_Central                0.031     0.781   0.188
## Unnamed__1_near_Long_Island       0.016     0.794   0.190
## Unnamed__2_near_Long_Island       0.009     0.819   0.172

To estimate tau as the strength of filtering, we will need to isolate only the simulations under the filtering model to perform paramter estimation, as the other models would be innappropriate. As a reminder, the lower tau is, the stronger the filtering effect.

%%R
## which simulation are filtering
filtIndex <- Assembly.Data$params[,2]=="filtering"
## isolate tau and sum.stats only for those simualtions
tauIndex <- Assembly.Data$params[filtIndex,"tau"]
summaryStatsTau <- Assembly.Data$summary.stats[filtIndex, ]

## constuct reference table for regression forest and build
refTableRegress <- data.frame(summaryStatsTau, tauIndex)
RF.regress <- randomForest(tauIndex ~., data=refTableRegress, ntree=1000, importance=T)

RF.regress
## 
## Call:
##  randomForest(formula = tauIndex ~ ., data = refTableRegress,      ntree = 1000, importance = T) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 25.26987
##                     % Var explained: 64.31

Estimate tau for the San Juan communities isolated using this RF regression forest.

%%R
## isolate summary stats for 
paramPredictions <- predict(RF.regress, filtComSumStats, type="response", predict.all = TRUE)
paramPredictions

If we want a prediction interval around the mean estimates of each parameter for each community, we need to make some assumptions about the distribution of the individual points around the predicted means. IF we do, we can take the predictions from the individual trees (the bootstrapped confidence interval piece) and generate a random value from the assumed distribution with that center. The quantiles for those generated pieces may form the prediction interval. Here is an example of doing this by adding normal (since we know the original data used a normal) deviations to the predictions with the standard deviation based on the estimated MSE from that forest.

%%R
## get prediciton intervals for the parameter estimates
paramIntervals <- sapply(1:nrow(filtComSumStats), function(i) {
  tmp <- paramPredictions$individual[i, ] + rnorm(RF.regress$ntree, 0, sqrt(RF.regress$mse))
  quantile(tmp, c(0.025, 0.975))
})

t(paramIntervals)
##             2.5%    97.5%
##  [1,]  4.6552870 36.88803
##  [2,]  5.7318160 36.23809
##  [3,]  5.6082080 36.89689
##  [4,]  6.0088076 36.75739
##  [5,]  4.8581990 36.57023
##  [6,] -1.4725130 32.88139
##  [7,] -0.5096655 32.52613
##  [8,]  4.9960451 36.09195
##  [9,]  3.5885372 34.18871

We see here some problems immediately, for one, the intervals extend beyound the upper bound of the prior distribution (30). We should note that these could be for several reasons. The first being that we have not conducted nearly enough simulations to be able to perform parameter estimation accuratly. We have ~1300 simulations under the filtering model that we are using for parameter estimation, when really, we should be using a much larger database of simulations (>10000). Additionally, this is not the only way to infer prediction intervals around the mean of RF estiamtes, there are a plethora of ways to try to do this and a lot of the appraoches currently being developed.

7. References

Butler, M.A. & King, A.A. 2004. Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution. Am. Nat.,164, 683-695. link

Felsenstein, J. 1985. Phylogenies and the Comparative Method. Am. Nat. 125, 1-15. link

Hansen, T.F. 1997. Stabilizing Selection and the Comparative Analysis of Adaptation. Evolution. 51, 1341-1351. link

Marx HE, Giblin DE, Dunwiddie PW, Tank DC. 2015 Deconstructing Darwin’s Naturalization Conundrum in the San Juan Islands using community phylogenetics and functional traits. Diversity and Distributions, 22, 318-331. link

Smith SA, Beaulieu JM, Donoghue MJ. 2009 Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evolutionary Biology, 9, 37. link