Question

Clustering Large Dataset In Terms Of Sequence Similarity Using Eg... Blastclust

1

Entering edit mode

12.4 years ago

kajendiran56 ▴ 120

Dear All, thank you for your time. I have a dataset containing 15,000 sequences. I wish to build a tree and thus my plan was to use BlastClust, a module in the Blast application to cluster them, then use a reference sequence from each cluster to build a crude tree. BlastClust has been running for some time now but I have no idea whether this is going to work or how long it will take.

I was wondering if there are any other ways of going about this with a such a large set of sequences?

Ideally, I wanted to be able to do a sequence alignment and then use that alignment of build a tree (which I agree will be complex with that number of sequences) and then look at the evolution of those sequences.

I tried something called MAFFT to do the sequence alignment, which did not give me any errors but gave me no output.

Any suggestions would be appreciated.

clustering sequence alignment tree • 7.1k views

ADD COMMENT • link updated 10.7 years ago by Biostar 20 • written 12.4 years ago by kajendiran56 ▴ 120

score 3 · Answer 1 · 2012-07-15

The classical approach for creating a tree would be to compute an alignment and a tree from that. However for such a large number of sequences you have to use some tricks. I guess BlastClust is one of them, but it really depends what you need this tree for. Depending on the application CD-Hit (see Chris' post) or UCLUST/USEARCH are alternatives.

If you want to stick to the classical approach, which needs an alignment, then your only options are MAFFT (make sure to use it's Part-Tree module!) and Clustal Omega and they will only work with sequences of reasonable small size. Once you have an alignment the tree building needs to be done with something real fast as well, one option for computing NJ trees is FastTree.

Andreas

PS: You might want to have a look at the just published paper Ultrafast clustering algorithms for metagenomic sequence analysis by Li et al. if you're dealing with NGS sequencing data, especially from Metagenomics

score 1 · Answer 2 · 2012-07-14

1

Entering edit mode

12.4 years ago

Chris ★ 1.6k

Have a look at CD-HIT [1]. Should take only minutes for that much sequences to cluster.

[1] http://weizhong-lab.ucsd.edu/cd-hit/

ADD COMMENT • link 12.4 years ago by Chris ★ 1.6k

0

Entering edit mode

Thank you for your suggestion. I have used this effectively to build a crude tree, I am amazed at how quickly it does this. Thank you

ADD REPLY • link 12.4 years ago by kajendiran56 ▴ 120