Question

Condense BLAST FASTA file

0

Entering edit mode

8.9 years ago

igor 13k

I am trying to reduce the size of a FASTA file that I got from the BLAST database archive. Some of the FASTA files they post already have identical sequences removed, but that still leaves a lot of very similar sequences. For example, I am working with "nt" and there are a lot of sequences in there that are very minor variations of each other or are overlapping. Is there a good way to combine those and eliminate "duplicate" entries?

blast fasta • 1.8k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.9 years ago by igor 13k

Ram · Answer 1 · 2016-02-16

1

Entering edit mode

8.9 years ago

Sean R Johnson ▴ 120

Try cd hit-est

They use cd hit to cluster protein sequences to make the UniRef databases. I presume you could do something similar for nucleotide sequences. Maybe someone already has.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.9 years ago by Sean R Johnson ▴ 120

0

Entering edit mode

Yes, there is CD-HIT and also USEARCH and vsearch, but all those can't handle large files.

ADD REPLY • link 8.8 years ago by igor 13k

0

Entering edit mode

Maybe you could do some kind of iterative strategy to make it more manageable? Like breaking the database up into 100 parts, removing the redundancies in the sub-parts and then combining them, and removing the redundancies in the combinations?

It really is a huge database, though, maybe there isn't a computationally feasible way to do this. Or maybe you just need to use a cluster computer. I'm not sure.

ADD REPLY • link 8.8 years ago by Sean R Johnson ▴ 120