Clustering Large Dataset In Terms Of Sequence Similarity Using Eg... Blastclust
2
1
Entering edit mode
12.4 years ago
kajendiran56 ▴ 120

Dear All, thank you for your time. I have a dataset containing 15,000 sequences. I wish to build a tree and thus my plan was to use BlastClust, a module in the Blast application to cluster them, then use a reference sequence from each cluster to build a crude tree. BlastClust has been running for some time now but I have no idea whether this is going to work or how long it will take.

I was wondering if there are any other ways of going about this with a such a large set of sequences?

Ideally, I wanted to be able to do a sequence alignment and then use that alignment of build a tree (which I agree will be complex with that number of sequences) and then look at the evolution of those sequences.

I tried something called MAFFT to do the sequence alignment, which did not give me any errors but gave me no output.

Any suggestions would be appreciated.

clustering sequence alignment tree • 7.1k views
ADD COMMENT
3
Entering edit mode
12.4 years ago
Andreas ★ 2.5k

The classical approach for creating a tree would be to compute an alignment and a tree from that. However for such a large number of sequences you have to use some tricks. I guess BlastClust is one of them, but it really depends what you need this tree for. Depending on the application CD-Hit (see Chris' post) or UCLUST/USEARCH are alternatives.

If you want to stick to the classical approach, which needs an alignment, then your only options are MAFFT (make sure to use it's Part-Tree module!) and Clustal Omega and they will only work with sequences of reasonable small size. Once you have an alignment the tree building needs to be done with something real fast as well, one option for computing NJ trees is FastTree.

Andreas

PS: You might want to have a look at the just published paper Ultrafast clustering algorithms for metagenomic sequence analysis by Li et al. if you're dealing with NGS sequencing data, especially from Metagenomics

ADD COMMENT
0
Entering edit mode

Thank you for your extensive suggestions. I have managed to use CD-HIT but you are correct in that this is not ideal. I managed to use Clustal Omega to build an alignment and I will use FastTree as you have suggested. Although I am not using NGS data, I will look at the paper you have suggested as well. Thank you once again

ADD REPLY
1
Entering edit mode
12.4 years ago
Chris ★ 1.6k

Have a look at CD-HIT [1]. Should take only minutes for that much sequences to cluster.

[1] http://weizhong-lab.ucsd.edu/cd-hit/

ADD COMMENT
0
Entering edit mode

Thank you for your suggestion. I have used this effectively to build a crude tree, I am amazed at how quickly it does this. Thank you

ADD REPLY

Login before adding your answer.

Traffic: 2724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6