I am analyzing differential gene expression for a plant based on 3'cDNA tag libraries generated from total RNA. This was started previously by another researcher where they used FASTX Toolkit for quality/trimming, BWA-MEM + HT-seq count, and then DESeq2 (back in 2015/2016).
I decided to go back and used the raw fastq files from the sequencing facility and instead use SortMeRNA + trimmomatic for quality/trimming, and then Salmon + tximport before using DESeq2.
What shocked me was the reads that mapped depending on the method. I get ~4 million reads that map via Salmon whereas the previous researcher had ~8 millions reads that mapped via BWA-MEM.
To troubleshoot, I used the FASTX Toolkit trimmed reads from the previous researcher and ran it through Salmon in case it was a filtering issue. I still had ~4 million reads compared to the ~8 million of these same files running through BWA-MEM.
I knew the two methods are very different and to expect some differences but by half causes much larger differences in the downstream results from DESeq2. Can someone lead me in the right direction with an explanation for this large of a difference or potentially what I am doing wrong (or maybe I am not doing something wrong?).
Not sure exactly what additional info people may need but I will be happy to elaborate if necessary!
Thank you! So to make sure I understand your suggestion correctly: I should take the output from BWA-MEM as well as the Salmon output and run them through qualimap to see if there is of target mapping?
I agree and thought the same thing about BWA and FASTX, which is why I went with SortMeRNA+trimmomatic and Salmom; however, I needed to justify my reasonings to my advisor- so thank you for the explanation about BWA not being splice aware. Is it ok to use Salmon, or would you recommend STAR or Hisat2 instead?
ExcludeEdit 02.11.19: Just realized that this is 3'-end cDNA libraries, so splicing is probably not too much of a factor, not sure about this. Sorry if causing confusion. I still would use an RNA-seq mapper to be sure. /Edit_end/.BWA mem
, it is not an RNA-seq aligner.Choice between
salmon
(or other tools likekallisto
) and traditional aligners are (from what I understand) active fields of research, but for the end-user a matter of preference (at least for me). I usesalmon
because it is fast and convenient (plus published, maintained, and accepted) and can be seamlessly integrated withtximport
to summarize the abundance estimates to the gene level. Use what you feel comfortable with, but if you choose traditional alignments, use splice-aware mappers. Qualimap (among other tools) can analyze how many reads actually overlap with exons (so on-target) or other genomic regions (off-targets, indicating contamination or library prep. issues).Edit-2: For a more detailed discussion see also this very recent preprint on exactly that matter from the group that develops
salmon
: https://www.biorxiv.org/content/10.1101/657874v2