Hi, I have a question regarding mapping qualities and multimapping reads. I'm working with the Allen Brain Atlas TBI data, which can be found at https://aging.brain-map.org/rnaseq/search.
I'm confused about the different numbers provided. Taking the first sample as example.
According to the provided data:
- rnaseq_total_reads: 32,275,545
- rnaseq_percent_reads_aligned_to_mrna: 31.7%
- rnaseq_percent_reads_aligned_to_ncrna: 7.11%
- rnaseq_percent_reads_aligned_to_genome_only: 48.1%
These percentages add up to 86.91%, which corresponds to approximately 28,050,676 reads out of the 32,275,545 total reads. However, when I inspect the BAM file for this sample using samtools view -c, I find that there are 105,439,824 reads, significantly more than the 32 million.
I suspect that multimapping reads may be the cause of this discrepancy, so I examined the MAPQ values in the SAM file. Here are the percentages for this sample:
- MAPQ 0: 42%
- MAPQ 255: 29%
- MAPQ 100: 20%
- MAPQ 0-99: 7%
Combined, the percentages for MAPQ 100 and 0-99 almost add up to the expected 28 million reads, but not exactly. For most samples, it is close, but for others, it is off by around 10%.
Is this 105 million caused by the multi-mapping? Is this percentage of zeros not very high? Im doing differential gene expression analysis and some supervised classification. I suppose I need to filter those reads out first? Or is this common practice to leave them in?
Help and insights are much appreciated.
It says default settings for RSEM aligned to the transcriptome and then reads that did not map with Bowtie to a human reference genome "using default settings except for two mismatch parameters: bowtie-e (set to 500) and bowtie-m (set to 100).".
Im using qCount of the Quasar package for quantification. With no min and max quality set, the final count table includes 125m counts. Even more than reads contained in the bam file. So I guess most multi-mapped reads do map to genes? This is so much more than the reported 32m, that's why I'm worried.
Okay, so I believe it is likely due to Multimappers+Transcript Isoforms. I believe default RSEM mapping uses STAR with more or less default settings.
If it helps you, I use STAR to map 76 million reads, allowing reporting of up to 10 mulitmapping alignments, to the genome. I get 64 million uniquely mapped alignments, but 175 million total alignments (these will be due to mulitimappers, ie 12 million reads have been mapped to 111 million locations ). When I examine my Transcriptome aligned bam from the exact same output, I get 243 million transcript alignments. This will be caused by one alignment mapping to multiple transcript isoforms. However, keep in mind, with these numbers, the translation from genomic to transcriptomic coordinates will drop many of my multimapping reads, so it is likely less than 175 million alignments that map to 243 million transcript alignments.
If your reads are short, you may have more multimapping alignments to genes. I don't perform a lot of isoform analysis, so I don't know how these are usually handled in those downstream processes. Of note, my example above is using PE150, which is why I see very few multimappers to genic regions.
I would imagine the discrepancy between your final counts and the total alignments may be how your counting package assigns reads. To me, when using the transcriptome sam, you should maybe not assign reads to all overlapping genes, since each of these should have its own alignments already.
Also, I'm not sure this is the cleanest way to analyze this type of data.
Thank you for your insights.