Regarding sequence dereplication with vsearch, I have seen the following statement:
"During dereplication, strictly identical sequences are grouped and receive the name of the first sequence of the group."
Now, I'm not exactly an expert on hash tables, so how do I know which exactly is the first sequence of the group--is it the one which occurs first in the input fasta file? If so, that would make things easy for me, because some of the sequences have important designations in their headers, which need to not get lost, so they will show up in BLAST results. Or is it more complicated? I ask because I am creating a custom database, composed of fasta files originating from different sources.
If you want to retain the descriptions in the headers (whether the sequences are duplicate or not) you will have to keep them. Sounds like you need to merge the headers from multiple sequences (where the sequence is identical) so only one sequence copy (but multiple headers) are kept?
Yes, that would be a solution. But, if it is not (easily) possible to achieve, then I could add that I can arrange the input FASTA so that all the headers I need to not lose are at the top.
Are those headers for unique sequences though because a deduplicating program is going to not pay attention to the headers? So you may still lose some headers.
Not sure how many duplicates there are but perhaps you could just do the search first and then handle the duplicates in post-processing/parsing?
The sequences with these special designations in the headers could be duplicates of other sequences that don't have them, so no, they're not necessarily for unique sequences. So then, how do I implement your original suggestion? (And if it's not possible with VSEARCH, please feel free to suggest another dereplicating program with which it is.)
If you mean to deal with the duplicates after BLASTing, I'll consider it if I can't find an easier way. Some BLAST programs only keep the top hit.
You should be able to keep as many as you want.