Hello Biostars!
Anybody can recommend any phylogenetic analysis tools to create trees for RFLP/AFLP/RAPD type data, apart from Treecon? (my datasets seem too big for this software)
Thanks in advance
JL
Hello Biostars!
Anybody can recommend any phylogenetic analysis tools to create trees for RFLP/AFLP/RAPD type data, apart from Treecon? (my datasets seem too big for this software)
Thanks in advance
JL
You could use R in combination with the proxy package:
Given your dataset as a tab delimited dataset "dataset.txt":
SNP St1 St2 St3 St4 St5 St6
1284995 0 0 0 1 0 0
1285001 1 1 1 0 1 1
1285017 0 0 0 0 0 0
1285034 0 0 1 0 0 0
1285040 0 1 0 0 0 0
1285070 0 0 0 0 1 1
Then do this in R (you may want to look up the Jaccard similarity, I am not entirely sure if that is the best one to use).
install.packages("proxy")
library(proxy)
## load dataset:
dataset <- read.table(file="dataset.txt", sep="\t", header=T, row.names=1)
## Calculate distance using Jaccard method:
d <- dist(t(dataset), method="Jaccard")
## Hierarchical cluster the data
# Note that I transpose the dataset otherwise I cluster the markers
hc <- hclust(d)
## Plot the data:
plot(hc)
@ Burnedthumb,
Thank you very much for your insight. I will try to do this immediately I just have an extra question, do you think that R graphical devices will be able to handle a dataset with hundreds/thousands of rows and columns? I ask you this because in my experience, representing such big datasets is not an easy task for R...
Thanks in advance for your kind help!
R itself will handle data up to a couple of gigabytes just fine. However, if you want to plot hundreds or thousands of samples the image gets unreadable. What you could do is instead of regular plotting, writing the dendrogram to a file like this:
## Plot the data to image with 1000 pixels width and height:
png(file="dendrogram.png", width=1000, height=1000)
plot(hc)
dev.off()
What are the dimensions of your data?
Depending if I transpose the table (if I want to inspect the clustering of strains or SNPs) I will have about 5000 strains and up to 3000 SNPs in some cases. So let's say 3000 x 5000 (rows x columns). Do you think it is a viable dataset for this task?
Thank you very much again!
You can make the plot as big as you want, however at that size it will be unreadable and won't get any information out it. You may want to filter the data a little bit first for the most interesting samples. Or you can cut[1] the tree at a specific height and plot those sub trees separately.
[1] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cutree.html
You can use MATLAB or Python to compute the dissimilarity matrix of the data first, and then draw the phylogenetic trees. I am not sure which method you can use to get the dissimilarity matrix of the data set.
Sorry I tried to post my answer to Biostar, but the message has not been successfully updated.
You may need to define the distance for two SNPs, for example, hamming distance. You may refer to the paper [Wang, C., Kao, W. H., & Hsiao, C. K. (2015). Using Hamming distance as information for SNP-sets clustering and testing in disease association studies. PloS one, 10(8), e0135918.]. [ http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135918] . Programming is needed.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Just in case it helps readers to figure out the kind of data I need to analyze/cluster, here is a sample:
Thanks once more