Entering edit mode
8.5 years ago
sudarshan1993
▴
10
Hi,
I have a fast file with about 500 protein sequences that have been compiled as the result of blast searches. Many of these sequences are quite similar to each other.
I would like to trim this fasta file such that only one copy of these highly similar sequences are left behind.
My approach so far has been to use the pairwise alignment tool in biopython, but this becomes very intractable as I will have to iterate over the file 250000 times.
Are there any alternatives/better methods to go about this?
Thanks!
CD-HIT is a popular choice to do this clustering.
CD-HIT worked great, thanks!
How did you compile your sequences from blast?
Just used BLAST tools on Biopython.
Are you looking for the most similar sequence to a db hit? If you did an output format such as tabular output, you can go back to your blast results, and group all the sequences that hit the same database subject sequence. Then find the one with the highest % aln, least number of mismatches, lowest e-value, etc. Also, -max_target_seqs = 1 should help you here to get one hit per query.
You don't have to iterate 250000 times, the maximum iteration would be 125250 times (you compare the first against the remaining 499, then the second against the remaining 489 etc.).