I am trying to calculate the percentage of reads aligned to a reference genome.
I used samtools flagstat
and this is what I have
24887959 + 0 in total (QC-passed reads + QC-failed reads)
18203689 + 0 secondary
210060 + 0 supplementary
0 + 0 duplicates
22937586 + 0 mapped (92.16% : N/A)
6474210 + 0 paired in sequencing
3237105 + 0 read1
3237105 + 0 read2
2537654 + 0 properly paired (39.20% : N/A)
4022194 + 0 with itself and mate mapped
501643 + 0 singletons (7.75% : N/A)
1475914 + 0 with mate mapped to a different chr
736257 + 0 with mate mapped to a different chr (mapQ>=5)
However, I noticed that the total number of reads is different to the input fastq files (3237105 read 1 + 3237105 read 2 = 6474210). After some reading, the numbers reported in samtools flagstat
also include multiple mapped reads.
So I try this:
samtools view -c -F 260 999.sorted.PE.bam
options
-c count reads and print the total number
-F bitcode skip reads with bits present in 'bitcode'
-F 260 only output reads that are not unmapped (flag 4 is not set) and only primary alignment (flag 256 is not set)
4733897 is what I got so if I divide this number with the total read number from fastq files (6474210) I have 0.731 (x 100 = 73.1%).
Is this the correct number to report as a percentage of mapped read? Not the one that's reported in samtools flagstat (92.16%)?
What was the aligner used? It will probably have it's own, more accurate stats, especially about mapping (multimaped, one pair mapped etc etc).
Thank you for your suggestion. I used bwa mem. I will look into it.