Reduce redundancy of alignment based on taxonomy and identity
0
0
Entering edit mode
8.9 years ago
fhsantanna ▴ 620

I would like to reduce the redundancy of my alignment based on taxonomy and identity.

I know that CD-HIT can do it based on identity, but do you know a way to also include taxonomy data? My concern is to preserve sequences very similar that could have been transferred between different taxa. If you know a script that can do it, I would appreciate.

Here is my idea, but I believe it is not too clever. It should have a simpler way...

Firstly, separate the sequences in files based on taxonomy, let's say by genus.

For each file, do a CD-HIT based on a identity value threshold.

After, concatenate the output files from CD-HIT.

Any suggestions?

CD-HIT alignment redundancy • 2.5k views
ADD COMMENT
1
Entering edit mode

I would think that it's reasonable to go about it the way you described it. Separate on taxonomy first and combine the cluster outputs later. This makes sense since you want to ensure that you keep sequences that may be transferred between the species, as they would be otherwise reduced to the same cluster...

ADD REPLY
0
Entering edit mode

I would like to reduce the redundancy of my alignment based on taxonomy and identity.

You lost me at the first sentence. Can you clarify this, by perhaps expanding that sentence into a paragraph?

ADD REPLY
0
Entering edit mode

I want to do a comprehensive phylogenetic reconstruction of a particular protein family.

Let us consider that I have blasted my protein of interest against a database from Genbank (refseq, nr). Even filtering the results by e value, consider that I would obtain thousands of proteins. Many of them would have little phylogenetic value because they would be too much similar among each other, and they could interfere the phylogenetic reconstruction and interpretation. For example, imagine I would have ten proteins of the genus Escherichia (or taxa of higher levels), and they would have more than 90% identity among each other. In order to decrease the complexity of the input data for phylogenetic reconstruction, I would pare down these sequences, maintaining only an "archetype ortholog" of Escherichia. This way, I would expect to reduce my input data to hundreds of proteins not only to improve computation, but also to ease my interpretation.

ADD REPLY

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6