Question

Dereplication of FASTA file with VSEARCH

0

Entering edit mode

7.4 years ago

lvogel ▴ 30

Regarding sequence dereplication with vsearch, I have seen the following statement:

"During dereplication, strictly identical sequences are grouped and receive the name of the first sequence of the group."

Now, I'm not exactly an expert on hash tables, so how do I know which exactly is the first sequence of the group--is it the one which occurs first in the input fasta file? If so, that would make things easy for me, because some of the sequences have important designations in their headers, which need to not get lost, so they will show up in BLAST results. Or is it more complicated? I ask because I am creating a custom database, composed of fasta files originating from different sources.

vsearch • 4.4k views

ADD COMMENT • link updated 6.0 years ago by Biostar 20 • written 7.4 years ago by lvogel ▴ 30

0

Entering edit mode

If you want to retain the descriptions in the headers (whether the sequences are duplicate or not) you will have to keep them. Sounds like you need to merge the headers from multiple sequences (where the sequence is identical) so only one sequence copy (but multiple headers) are kept?

ADD REPLY • link 7.4 years ago by GenoMax 148k

0

Entering edit mode

Yes, that would be a solution. But, if it is not (easily) possible to achieve, then I could add that I can arrange the input FASTA so that all the headers I need to not lose are at the top.

ADD REPLY • link 7.4 years ago by lvogel ▴ 30

0

Entering edit mode

all the headers I need to not lose are at the top

Are those headers for unique sequences though because a deduplicating program is going to not pay attention to the headers? So you may still lose some headers.

Not sure how many duplicates there are but perhaps you could just do the search first and then handle the duplicates in post-processing/parsing?

ADD REPLY • link 7.4 years ago by GenoMax 148k

0

Entering edit mode

The sequences with these special designations in the headers could be duplicates of other sequences that don't have them, so no, they're not necessarily for unique sequences. So then, how do I implement your original suggestion? (And if it's not possible with VSEARCH, please feel free to suggest another dereplicating program with which it is.)

ADD REPLY • link 7.4 years ago by lvogel ▴ 30

0

Entering edit mode

perhaps you could just do the search first and then handle the duplicates in post-processing/parsing?

If you mean to deal with the duplicates after BLASTing, I'll consider it if I can't find an easier way. Some BLAST programs only keep the top hit.

ADD REPLY • link 7.4 years ago by lvogel ▴ 30

1

Entering edit mode

Some BLAST programs only keep the top hit.

You should be able to keep as many as you want.

ADD REPLY • link 7.4 years ago by GenoMax 148k

score 2 · Accepted Answer · 2017-08-25

2

Entering edit mode

7.4 years ago

lelle ▴ 830

You can use the --uc filename option to get a detailed table which tells you which sequences are represented by which sequence. Of course to map that back with your blast results will be some work...

EDIT: removed the suggestion to use --relabel_keep, because it actually has nothing to do with the problem.

ADD COMMENT • link 7.4 years ago by lelle ▴ 830

0

Entering edit mode

Thanks. But the --relabel_keep option appears to not be available in version 2.4.3. It's not letting me use it. ??

ADD REPLY • link 7.4 years ago by lvogel ▴ 30

0

Entering edit mode

So now, it allows "--relabel keep" (with space instead of underscore) but doesn't actually keep the labels besides the first one. I'm still trying to figure out how to get it to do what I want.

ADD REPLY • link 7.4 years ago by lvogel ▴ 30

1

Entering edit mode

The --relabel_keep is working fine for my in version 2.4.3. "--relabel keep" should rename all you sequences to keep1, keep2, keep3 and so on Anyway, I think I completely misinterpreted your original question. I don't know why I was thinking you were using hashes. Sorry. If you just run "vsearch --derep_fulllength" the sequences should not be renamed. But you will only have one name in the header (as you described) and not know which sequences are represented by this. If you want to know which sequences each sequence in your output is representing you will have to use the file you get from the --uc option (as far as I know).

ADD REPLY • link 7.4 years ago by lelle ▴ 830

0

Entering edit mode

Thanks for the reply. You're right on all counts, as far as I now know, too. Before, I was confused too, and using the wrong version. I'll keep putting all the sequences with headers I want to keep at the top of the fasta, and I'll be able to tell from the uc table if it's not using the names I expected. I'll accept your answer now. :)

ADD REPLY • link 7.4 years ago by lvogel ▴ 30