Question

Salmon indices differences

0

Entering edit mode

12 months ago

enee ▴ 20

I am trying to run Salmon locally on prostate cancer samples, and I used this command: salmon quant -i data/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/salmon_sa_index/default -l A -1 SRR21898893_1.fastq.gz -2 SRR21898893_2.fastq.gz --validateMappings --gcBias -o transcripts_quant

I downloaded the pre-built versions of the partial decoy (salmon_partial_sa_index) and full decoy (salmon_sa_index) indices from on refgenie following the Salmon documentation.

I tried first running the command with the full decoy index getting the error: Exception : [std::bad_alloc] probably due to a memory problem (my macbook has 16GB and the salmon_sa_index weighs 18GB) and then running the command with the partial decoy index getting my counts successfully.

My question is, what exactly is the difference and how much do the results differ using the two indexes? Also, what flags should I use besides -gcc?

salmon RNAseq index refgenie decoy • 590 views

ADD COMMENT • link updated 12 months ago by ATpoint 85k • written 12 months ago by enee ▴ 20

score 0 · Answer 1 · 2023-11-30

In theory the full genome decoy is the best since it includes the entire genome, hence decoying potential spurious mappings due to DNA contamination. That is in simple terms, when you only map to the transcriptome but have DNA contamination salmon might incirrectly map it to the transcriptome, while with full genome decoy it will know that it better matches to a genomic DNA region and removes it, not falsely counting it as a transcript count.

The partial index (in theory) is very similar. For this index one first runs a tool that identifies similarities between transcript sequences (exons) and genome sequences and then only includes regions into the index that have somewhat a sequence similarity between exon and genome. The idea is that salmon would only incorrectly quantify a DNA read against a transcript if the sequences are somewhat similar. That way the index is much smaller than including the entire genome. Results compared to full genome should be very similar.

Then there is the index with only the transcriptome, which then lacks the benefits of what I describe above. As always with these things, it's mainly a theoretical issue because we usually lack solid ground truth to convincingly do benchmarking.

I personally always do full genome decoy because I have the computational resources to never care about CPUs and memory restrictions. If you only have that MacBook then go with the partial decoy index. Results should really be very similar beyond that you would notice it.