I am trying to reduce the size of a FASTA file that I got from the BLAST database archive. Some of the FASTA files they post already have identical sequences removed, but that still leaves a lot of very similar sequences. For example, I am working with "nt" and there are a lot of sequences in there that are very minor variations of each other or are overlapping. Is there a good way to combine those and eliminate "duplicate" entries?
Yes, there is CD-HIT and also USEARCH and vsearch, but all those can't handle large files.
Maybe you could do some kind of iterative strategy to make it more manageable? Like breaking the database up into 100 parts, removing the redundancies in the sub-parts and then combining them, and removing the redundancies in the combinations?
It really is a huge database, though, maybe there isn't a computationally feasible way to do this. Or maybe you just need to use a cluster computer. I'm not sure.