Question

Recommendations about phylogenetic analysis tools for RFLP/AFLP/RAPD data

0

Entering edit mode

7.7 years ago

JL • 0

Hello Biostars!

Anybody can recommend any phylogenetic analysis tools to create trees for RFLP/AFLP/RAPD type data, apart from Treecon? (my datasets seem too big for this software)

Thanks in advance
JL

Phylogenetic-tree • 2.5k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 7.7 years ago by JL • 0

0

Entering edit mode

Just in case it helps readers to figure out the kind of data I need to analyze/cluster, here is a sample:

##SNP   St1 St2 St3 St4 St5
1284995 0   0   0   1   0
1285001 1   1   1   0   1
1285017 0   0   0   0   0
1285034 0   0   1   0   0
1285040 0   1   0   0   0
1285070 0   0   0   0   1

Thanks once more

ADD REPLY • link 7.7 years ago by JL • 0

score 3 · Answer 1 · 2017-03-16

3

Entering edit mode

7.7 years ago

Burnedthumb ▴ 90

You could use R in combination with the proxy package:

Given your dataset as a tab delimited dataset "dataset.txt":

SNP St1 St2 St3 St4 St5 St6
1284995 0   0   0   1   0   0
1285001 1   1   1   0   1   1
1285017 0   0   0   0   0   0
1285034 0   0   1   0   0   0
1285040 0   1   0   0   0   0
1285070 0   0   0   0   1   1

Then do this in R (you may want to look up the Jaccard similarity, I am not entirely sure if that is the best one to use).

install.packages("proxy")
library(proxy)

## load dataset:
dataset <- read.table(file="dataset.txt", sep="\t", header=T, row.names=1)

## Calculate distance using Jaccard method:
d <- dist(t(dataset), method="Jaccard")

## Hierarchical cluster the data
# Note that I transpose the dataset otherwise I cluster the markers
hc <- hclust(d)

## Plot the data:    
plot(hc)

The result

ADD COMMENT • link 7.7 years ago by Burnedthumb ▴ 90

0

Entering edit mode

@ Burnedthumb,

Thank you very much for your insight. I will try to do this immediately I just have an extra question, do you think that R graphical devices will be able to handle a dataset with hundreds/thousands of rows and columns? I ask you this because in my experience, representing such big datasets is not an easy task for R...

Thanks in advance for your kind help!

ADD REPLY • link 7.7 years ago by JL • 0

1

Entering edit mode

R itself will handle data up to a couple of gigabytes just fine. However, if you want to plot hundreds or thousands of samples the image gets unreadable. What you could do is instead of regular plotting, writing the dendrogram to a file like this:

## Plot the data to image with 1000 pixels width and height:
png(file="dendrogram.png", width=1000, height=1000)
plot(hc)
dev.off()

What are the dimensions of your data?

ADD REPLY • link 7.7 years ago by Burnedthumb ▴ 90

0

Entering edit mode

Depending if I transpose the table (if I want to inspect the clustering of strains or SNPs) I will have about 5000 strains and up to 3000 SNPs in some cases. So let's say 3000 x 5000 (rows x columns). Do you think it is a viable dataset for this task?

Thank you very much again!

ADD REPLY • link 7.7 years ago by JL • 0

1

Entering edit mode

You can make the plot as big as you want, however at that size it will be unreadable and won't get any information out it. You may want to filter the data a little bit first for the most interesting samples. Or you can cut[1] the tree at a specific height and plot those sub trees separately.

[1] https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cutree.html

ADD REPLY • link 7.7 years ago by Burnedthumb ▴ 90

score 0 · Answer 2 · 2017-03-15

0

Entering edit mode

7.7 years ago

Charles Yin ▴ 180

You can use MATLAB or Python to compute the dissimilarity matrix of the data first, and then draw the phylogenetic trees. I am not sure which method you can use to get the dissimilarity matrix of the data set.

ADD COMMENT • link 7.7 years ago by Charles Yin ▴ 180

0

Entering edit mode

@ Channgchuan Yin, could you elaborate your suggestions a little bit (you use custon scripts or there is some module/package/software you would recommend)?

ADD REPLY • link 7.7 years ago by JL • 0

1

Entering edit mode

Sorry I tried to post my answer to Biostar, but the message has not been successfully updated.

You may need to define the distance for two SNPs, for example, hamming distance. You may refer to the paper [Wang, C., Kao, W. H., & Hsiao, C. K. (2015). Using Hamming distance as information for SNP-sets clustering and testing in disease association studies. PloS one, 10(8), e0135918.]. [ http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135918] . Programming is needed.

ADD REPLY • link 7.7 years ago by Charles Yin ▴ 180