Hi everybody,
I want to compute the MAPQ distribution in a BAM file (ExomeSeq, paired-end reads, aligned with BWA). In order to extract only the field of interest, "mapq", from the BAM file I am using the R package Rsamtools. I have a couple of doubts about outputs:
a) Retrieving header information in the BAM file.
First of all, I used scanBamHeader(bamfile)[[1]][["targets"]]
to obtain the list of names of all references in the bam file header information.
In the retrieved list, in addition to chromosomes names (1,2,3,...X,Y), I found also labels such as GL000207.1, NC_007605 and MT. What is the meaning of these labels? Is it significative to include them when MAPQ distribution is calculated?
b) MAPQ scores.
Once setting params with: param<-ScanBamParam(what="mapq")
, the bamfile is imported in R with: bam<-ScanBam(file=bamfile,param=param)
.
the list bam[[1]]$mapq
contains MAPQ scores of reads in the BAM file.
I cannot understand the meaning of MAPQ when its value is "NA".
Many thanks for the help.
Dear Devon Ryan, thank you for your precise and very useful replies. Last thing about MAPQ of NA: As you correctly wrote, in BWA "a value 255 indicates that the mapping quality is not available". Do you know the possible read problems that cause this unavailability? Many thanks again.
That I don't know. I'm not overly familiar with the internals of BWA since I rarely use it (don't read that as a critique of BWA, which is a great aligner, I'm just more familiar with tweaking other aligners so I use them more frequently).
OK no problem, thank you.