I'm using Salmon's selective alignment mode (--validateMappings) with decoy-aware index generated from gencode.transcriptome.fa and decoys.txt. I want to make sure that salmon does not output quantifications for these decoy sequences, correct? These decoy sequences begin with "GL....". They are just to improve the quantifications for those sequences that are actually in the reference. Thanks in advance!
Can someone please explain to me, what this decoy sequences are? What are the advantages of using salmon with the
--decoys
parameter?Can it also be used without them?
Hi Assa,
Sure. The decoy sequences are regions of the target genome that are sequence similar to annotated transcripts. These are the regions of the genome most likely to cause mismapping (e.g. transcribed pseudogenes, etc.). There are 3 ways to run salmon : (a) with just the annotated transcriptome being indexed (b) with the annotated transcriptome and a small set of decoys computed using MASHMAP to search transcripts against the genome and (c) with the annotated transcriptome and using the entire genome as decoy sequence.
The (a) method requires the fewest resources, (b) requires a good deal of resources to run the MASHMAP step, but the resulting index is similar to that of (a) and it avoids the most obvious cases of misalignment. (c) results in the largest index, but it's the most effective at avoiding potentially spurious mappings.
Salmon can be used without decoy sequences (and sometimes, this is necessary — e.g. in a de novo assembly, there will likely be no possibility for decoys). It can also be run without decoys in reference organisms. It is simply the case that decoys help avoid certain cases of misalignment that can't be adjudicated with the transcriptome alone, and therefore can lead to somewhat more robust estimates of abundance in the presence of the expression of unannotated sequences.
Thanks, it sounds logical and straightforward. Is this something specific for Salmon, or does Kallisto has also something similar?
This is specific to salmon.
Hi Rob, thanks for the explanation. For option (c) - should the decoy genome sequences that are concatenated onto the end of the transcriptome file be masked in any way? or just plain old fasta scaffolds? Cheers!
Plain old fasta.