For my PhD project I am trying to use copy number variations (CNVs) to deduce phylogeny between 9 cancer cell lines... and it seem to work quite well. It is not hard to get to a 9x9 distance matrix showing the copy number variations between each of them (400-2000 CNVs out of 20000 genes in the genome). But do any of you have experience with how to draw a consensus tree with bootstrapping from this (and not just a tree based on the one distance matrix)? I have looked into the R-packages ape, phangorn and phyclust but all of them require you to start with sequences that you then align. They are not very happy about accepting a distance matrix alone (because bootstrapping require taking sub-samples from your dataset).
It is not hard to make e.g. 1000 new distance matrices based on extraction of e.g. 1/4 the dataset by random
coll[sample(nrow(coll),size=20000/4,replace=FALSE),]
So I believe the question is: Would it be possible to use 1000 distance matrices to create one consensus phylogenetic tree? R would be preferred but any input is appreciated.
Kind regards
-Christian
Hi David
Thank you for answering my question. The method works... but the resulting consensus tree loose most of the information seen in the original tree. If we use your example and run your code and follow it with
You get a plot of what a straightforward nj tree (named "original) showing clusters of t3,4 another wih t6-8 and one with the rest. This is tree made from the entire dataset.
If you plot a consensus tree based on one single permutation of the data (B=1) all this is lost and you have a plot where the t3-4 and the t6-8 are still there but we cannot anymore conclude that they are distinct clusters from the rest of the dataset. I see the same result with my data: after running B=1 or B=1000 i end up with a tree that has no distinctive features.
Is it because my dataset is not really showing me anything (i severely doubt this) or is boot.phylo taking out a very very small subsample => the clusters are still there but the distance between them are now much much shorter than before? Can you explain to me why plotting after one round of bootstrapping is so fundamentally different from the original plot?
Hi Christian,
I can't talk about specifics because we are looking a different trees, for me, setting the seed and running the code gives the point estimate tree as
I'm not sure how you are defining clusters (on unrooted trees), but you should be aware consensus trees have no branch lengths.
In answer to some of your other questions, boot.phylo is re sampling the matrix with replacements. Using the variables above each replicate is equivalent to
With B=1 you have the result of a single bootstrap sample, it will be different than the tree estimated from the whole data-set but no less resolved (though, again, the 'consensus' tree will have no branch lengths, if you want these just plot
bs$trees[[1]]
)Dear David
Thank your that that little comment. that made all the difference. Now the (B=1) is similar to the one from the entire dataset and B=100 is slight different. My dataset had no changes at all after B=1000 so it must be pretty stable. Thanks a lot for your help. Have a nice day