Question

BAM file contains much more reads than expected

0

Entering edit mode

16 months ago

brian ▴ 10

Hi, I have a question regarding mapping qualities and multimapping reads. I'm working with the Allen Brain Atlas TBI data, which can be found at https://aging.brain-map.org/rnaseq/search.

I'm confused about the different numbers provided. Taking the first sample as example.

According to the provided data:

rnaseq_total_reads: 32,275,545
rnaseq_percent_reads_aligned_to_mrna: 31.7%
rnaseq_percent_reads_aligned_to_ncrna: 7.11%
rnaseq_percent_reads_aligned_to_genome_only: 48.1%

These percentages add up to 86.91%, which corresponds to approximately 28,050,676 reads out of the 32,275,545 total reads. However, when I inspect the BAM file for this sample using samtools view -c, I find that there are 105,439,824 reads, significantly more than the 32 million.

I suspect that multimapping reads may be the cause of this discrepancy, so I examined the MAPQ values in the SAM file. Here are the percentages for this sample:

MAPQ 0: 42%
MAPQ 255: 29%
MAPQ 100: 20%
MAPQ 0-99: 7%

Combined, the percentages for MAPQ 100 and 0-99 almost add up to the expected 28 million reads, but not exactly. For most samples, it is close, but for others, it is off by around 10%.

Is this 105 million caused by the multi-mapping? Is this percentage of zeros not very high? Im doing differential gene expression analysis and some supervised classification. I suppose I need to filter those reads out first? Or is this common practice to leave them in?

Help and insights are much appreciated.

rna-seq • 1.2k views

ADD COMMENT • link 16 months ago by brian ▴ 10

score 0 · Answer 1 · 2023-07-04

0

Entering edit mode

16 months ago

rfran010 ★ 1.3k

I am not familiar with that dataset, but it is very possibly due to multimappers. What are the mapping parameters?

Filtering them out depends on what you want to accomplish. For a basic analysis, you can include only unique reads. Sometimes you want to keep them, for example when looking at repetitive elements. However, if you are just doing DEG analysis, I imagine most of the multimapping reads don't actually map to genes anyways.

If you do keep multimapped alignments, then you would need to decide how to handle them. My preference is to fractionally assign them to aligned loci. Eg. if one read maps to five loci (or genes), then those loci would be given 0.2 reads each.

ADD COMMENT • link 16 months ago by rfran010 ★ 1.3k

0

Entering edit mode

It says default settings for RSEM aligned to the transcriptome and then reads that did not map with Bowtie to a human reference genome "using default settings except for two mismatch parameters: bowtie-e (set to 500) and bowtie-m (set to 100).".

Im using qCount of the Quasar package for quantification. With no min and max quality set, the final count table includes 125m counts. Even more than reads contained in the bam file. So I guess most multi-mapped reads do map to genes? This is so much more than the reported 32m, that's why I'm worried.

ADD REPLY • link 16 months ago by brian ▴ 10

0

Entering edit mode

Okay, so I believe it is likely due to Multimappers+Transcript Isoforms. I believe default RSEM mapping uses STAR with more or less default settings.

If it helps you, I use STAR to map 76 million reads, allowing reporting of up to 10 mulitmapping alignments, to the genome. I get 64 million uniquely mapped alignments, but 175 million total alignments (these will be due to mulitimappers, ie 12 million reads have been mapped to 111 million locations ). When I examine my Transcriptome aligned bam from the exact same output, I get 243 million transcript alignments. This will be caused by one alignment mapping to multiple transcript isoforms. However, keep in mind, with these numbers, the translation from genomic to transcriptomic coordinates will drop many of my multimapping reads, so it is likely less than 175 million alignments that map to 243 million transcript alignments.

If your reads are short, you may have more multimapping alignments to genes. I don't perform a lot of isoform analysis, so I don't know how these are usually handled in those downstream processes. Of note, my example above is using PE150, which is why I see very few multimappers to genic regions.

I would imagine the discrepancy between your final counts and the total alignments may be how your counting package assigns reads. To me, when using the transcriptome sam, you should maybe not assign reads to all overlapping genes, since each of these should have its own alignments already.

Also, I'm not sure this is the cleanest way to analyze this type of data.

ADD REPLY • link 16 months ago by rfran010 ★ 1.3k

1

Entering edit mode

Thank you for your insights.

ADD REPLY • link 16 months ago by brian ▴ 10

score 0 · Answer 2 · 2023-07-05

0

Entering edit mode

16 months ago

Buffo ★ 2.4k

However, when I inspect the BAM file for this sample using samtools view -c, I find that there are 105,439,824 reads, significantly more than the 32 million.

That means there are ~105 million alignments, not reads. So, yes, it is likely caused by multimappers.

Is this percentage of zeros not very high? Im doing differential gene expression analysis and some supervised classification. I suppose I need to filter those reads out first? Or is this common practice to leave them in?

From my experience, it depends on the species you're using and the quality of the reference genome. But yes, I would say it is high. Therefore, I wouldn't filter the BAM, I would increase the mapping quality/read length and consider only reads mapping concordantly (if you're using paired-end reads).

ADD COMMENT • link 16 months ago by Buffo ★ 2.4k

0

Entering edit mode

Thanks for your reply. The strange thing is no alignments get reported as secondary.

105439824 + 0 in total (QC-passed reads + QC-failed reads)
105439824 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
105439824 + 0 mapped (100.00% : N/A)
105439824 + 0 primary mapped (100.00% : N/A)
105439824 + 0 paired in sequencing
52719912 + 0 read1
52719912 + 0 read2
105439824 + 0 properly paired (100.00% : N/A)
105439824 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (map

And if I inspect the flags, those are only ,147, 163, 83, 99.

samtools view -F 0x4 sample_1.sorted.bam | cut -f 1 | sort | uniq | wc -l gives 29130547 reads.

How is this possible? I guess somewhere the reporting is not right. I see a lot that ~30m reads is a very common library size.

ADD REPLY • link 16 months ago by brian ▴ 10