Hi all,
We have recently conducted total RNA-seq (~200M reads) and mRNA-seq (~100M reads) in the same samples. Following the RNA-seq, we used Salmon selective alignment (SA) to align the reads to the Ensembl human transcriptome. This resulted in a comparable number of reads (~40M) available from both RNA-seq approaches for downstream DGE analyses.
It is reasonable to think that the exon parts of a pre-mRNA will be aligned to the reference transcriptome following total RNA-seq if the length of the exons is longer than the read length. However, the Salmon SA algorithms seem to have performed well, generating comparable transcript reads and strong correlations of the DGE genes identified from the total RNA-seq and the mRNA enriched sequencing.
I have been wondering what may be the reasonable explanations for these observations above. It is appreciated if you could shed some light into this.
Thanks Guan
Have you considered adding to the story with the following:
This should tell you a bit more about your library composition (adapters, sequencing quality), and specifically where they map proportionally to Exonic / Intronic / Intergenic / UTR regions. I'd expect that with your total RNA, you'd see more intronic reads from the pre-mRNA. It's interesting that only 40% of an mRNA library is mapping to transcripts though... but that could have tissue / kit specific explanations.
I'd be curious as to how Salmon's SA approach specifically deals with situations where there's intronic overhang in a read from pre-mRNA. I hope Rob could comment on that.
Downstream, Salmon has worked well in my experience when it gets to interpreting differential expression, even in noisy libraries.
Hi Andrew,
Yes, we have run FastQC, FastQ Screen, and RSeQC. Briefly, the reads quality score was >34 and no sample contamination following both RNA-seq. In the read distribution analysis, more reads were assigned to introns following total RNA-seq 37.4% vs. mRNA-seq 8.7%; %CDS_exons were 52.4% total RNA-seq and 74.8% mRNA-seq.
The HISAT2 mapping rates were 94.2% total RNA-seq (197.9M reads) and 95% mRNA-seq (110.4M reads). Salmon rates were 38.8% total RNA-seq (38.3M reads) and 81.9% mRNA-seq (45.1M reads). Please note, in total RNA-seq, rRNA and globin mRNA were depleted/reduced; while in mRNA-seq, no globin mRNA reduction conducted. I counted the globin encoding mRNAs (HBB, HBA1 and HBA2) following mRNA-seq, total reads of the 3 genes were ~7.8M; roughly coinciding with the difference between 45.1M and 38.3M.
Yes, it would be good to know how the Salmon SA handles possible pre-mRNA mapping to reference transcriptome.
Thanks Guan
As salmon maps against the provided txtome which is usually spliced you will need an extended transcriptome to keep intronic reads. Here is a tutorial on how to parse put it together. It is a single-cell velocity tutorial so only the actual index creation part is relevant: https://combine-lab.github.io/alevin-tutorial/2020/alevin-velocity/
Hisat alignes against the genome where introns are included, so these mapping rates are expected I'd say.
Thanks ATpoint. The aim here is not really to keep the intronic reads, but to understand how pre-mRNAs, where the length of exons is longer than the read length, are handled by Salmon SA algorithms during alignment; as these exon parts of pre-mRNAs may be mapped to the transcriptome along with the target mRNAs.
I'm assuming your Salmon mapping rates are for read pairs? If that's the case, then it sounds encouraging that your mRNA library is mostly coding transcript that Salmon maps, and less so for the ribo depleted total RNA.
Back to the root of your question... I think it's reasonable that pre-mRNA that isn't fully spliced should be quantified by Salmon where the read fully encompasses the exonic region. Where there's overhang between the exonic, and intronic region (which should be captured by the decoy), It's probably best for Rob to comment on the exact behaviour of Salmon, and if that uncertainty is captured in the EM algorithm.
Thanks Andrew, and yes, they were paired reads.