Question

Cleaning up .fasta files by removing redundant sequences pre-alignment?

0

Entering edit mode

5.0 years ago

Tbr • 0

I have a selection of 16S sequences derived from different species, clustered into several different fasta files based on which genus the sequence came from. I would like to perform alignments on the sequences in order to probe for conserved regions for each genus. However, I have a few shorter sequences in which the full length of this sequence are contained entirely within longer ones. I just wanted to know if there are any softwares in which I can clean up these data to remove these redundant sequences before aligning them (as I currently do not have access to a huge amount of computational memory so removing any extraneous data would be of great benefit).

Any advice would be greatly appreciated!

alignment • 1.4k views

ADD COMMENT • link updated 5.0 years ago by Mensur Dlakic ★ 28k • written 5.0 years ago by Tbr • 0

score 2 · Accepted Answer · 2019-12-04

2

Entering edit mode

5.0 years ago

Mensur Dlakic ★ 28k

CD-HIT is specifically designed for that purpose. It removes all sequence above a certain level of identity, and always retains the longest sequence.

cd-hit -i input.fas -o input.99 -c 0.99 -n 5

This will remove sequences at 99% identity - you may need to adjust that threshold.

ADD COMMENT • link 5.0 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Oh yes that is exactly what I was looking for, I knew it must have existed somewhere! Thank you

ADD REPLY • link 5.0 years ago by Tbr • 0