Calculating percentage of reads mapped to a reference genome
1
2
Entering edit mode
6.2 years ago
rujiporn.sun ▴ 20

I am trying to calculate the percentage of reads aligned to a reference genome.

I used samtools flagstat and this is what I have

24887959 + 0 in total (QC-passed reads + QC-failed reads)
18203689 + 0 secondary
210060 + 0 supplementary
0 + 0 duplicates
22937586 + 0 mapped (92.16% : N/A)
6474210 + 0 paired in sequencing
3237105 + 0 read1
3237105 + 0 read2
2537654 + 0 properly paired (39.20% : N/A)
4022194 + 0 with itself and mate mapped
501643 + 0 singletons (7.75% : N/A)
1475914 + 0 with mate mapped to a different chr
736257 + 0 with mate mapped to a different chr (mapQ>=5)

However, I noticed that the total number of reads is different to the input fastq files (3237105 read 1 + 3237105 read 2 = 6474210). After some reading, the numbers reported in samtools flagstat also include multiple mapped reads.

So I try this:

samtools view -c -F 260 999.sorted.PE.bam

options

  -c  count reads and print the total number
  -F bitcode  skip reads with bits present in 'bitcode'
  -F 260  only output reads that are not unmapped (flag 4 is not set) and only primary alignment (flag 256 is not set)

4733897 is what I got so if I divide this number with the total read number from fastq files (6474210) I have 0.731 (x 100 = 73.1%).

Is this the correct number to report as a percentage of mapped read? Not the one that's reported in samtools flagstat (92.16%)?

alignment • 12k views
ADD COMMENT
0
Entering edit mode

What was the aligner used? It will probably have it's own, more accurate stats, especially about mapping (multimaped, one pair mapped etc etc).

ADD REPLY
0
Entering edit mode

Thank you for your suggestion. I used bwa mem. I will look into it.

ADD REPLY
5
Entering edit mode
6.2 years ago

I think you forgot to exclude supplementary alignments. If you do, you should find 69.9% mapping.

Number of input reads = total - secondary - supplementary 
                      = 24887959 - 18203689 - 210060 
                      = 6474210

Number of unmapped reads = total - mapped
                         = 24887959 - 22937586
                         = 1950373

Number of mapped reads (primary excluding supplementary and secondary) = input - unmapped
                         = 6474210 - 1950373
                         = 4523837

% mapping = mapped (only primary) / input*100
          = 4523837 / 6474210 *100
          = 69.9%
ADD COMMENT
0
Entering edit mode

-F 260 in samtools view should exclude unmapped reads and non-primary reads. At least, I know I am on the right direction. Thank you for your comments and calculations.

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6