Question

Reducing Number Of Sequences For Phylogentic Tree Construction

1

Entering edit mode

11.5 years ago

Pappu ★ 2.1k

I got several thousand sequences from blastp search. So I removed the sequences with >90% identity by cd-hit before MSA and also did the same after MSA construction. The assumption was that the sequences with >90% identity will end up in closly related branches. I am wondering if this cutoff makes sense.

• 2.8k views

ADD COMMENT • link updated 11.5 years ago by DG 7.3k • written 11.5 years ago by Pappu ★ 2.1k

3

Entering edit mode

Probably more justified way of reducing the number of sequences would be to build a distanced-based tree (NJ, UPGMA) first for the whole set of sequences. And then you could use Dendroscope3 or iTol programs to auto collapse clades containing very closely-related sequences. During this auto-collapsing, the average branch length to all leaves is calculated for all internal nodes, and those clades where this value is below your threshold are collapsed. You can also specify your own support value or a certain node length.

ADD REPLY • link 11.5 years ago by Andrzej Zielezinski 11k

score 1 · Answer 1 · 2014-02-03

I'll preface my answer with "it depends." If you were looking at strains of bacteria for instance the 90% cut-off might be too low for the question you are trying to answer. But, for most applications of phylogenetics collapsing at 90% sequence identity is generally considered fairly routine. If you need to prune down your number of taxa further the suggestion by @a.zielezinski is worth looking in to. Generally what you want to do is prune taxa when you need to make the dataset more manageable in terms of size for alignment and estimating the phylogeny while retaining as much real sequence diversity as possible.