Calculation of per-base average Quality (MAPQ, basecall quality Q20) from BAMs across multiple samples
0
0
Entering edit mode
12 weeks ago
Novice ▴ 10

Hi all, Sorry for the long question, but I am trying to be detailed for clarity. I am trying to process multiple short-read Illumina WGS BAMs (so large BAMs) and extract 3 elements summarized across all samples. The purpose of this is to evaluate regions of the human genome that are of "Good Quality" and "not-so good quality" and to see if this is consistent across samples.

  1. Average MAPQ at every base in the genome across all aligned reads, across all samples
  2. Percent of the base call exceeding Q20 (Phred-scaled quality score) at every base in the genome, across all samples
  3. Normalized depth at every base, across all samples (depth at a base/median autosomal depth).

Clearly, bp-level information across the whole human genome (hg38) across multiple samples would be computationally intensive and the output files may be large. So, I was thinking of doing this first for one sample at a time and try to learn from pattern of distribution in individual samples, before computing across multiple samples.

I am also new to this, so it took me some time to figure out that I can use samtools for this. It appears something like this can export all three information, but I am not sure how to post-processing the output.mpileup.txt

samtools mpileup -d 1000000 -q 0 -B -s -f hg38.fa input.bam > output.mpileup.txt

(-q0 because I want all reads to be considered)

or

samtools view input.bam

Given the complexity (at least to me this seems very difficult to do), I thought it might be better to start with a single sample BAM to output the following columns (first for one sample, and then also across samples)

  1. Chromosome
  2. Position
  3. Number of reads at that position
  4. Average MAPQ at that position (across all reads at that position)
  5. Number of low-quality reads at that position (MAPQ=0)
  6. Percent of the bases exceeding Q20 at that position
  7. Normalized Dept (depth at a base/median autosomal depth)

Then I want to iterate the above across multiple samples. So, all the per-base metrices (3-7) should be calculated across all samples. I probably have to access a computing cluster that can process one chromosome each, one chromosome per sample. Any help with this samtools code would be appreciated.

There is a very old Biostar question that is very similar to mine, but it appears the average is across a region and not per-base.

I also do have a question about computing averages for MAPQ and Q20 averages. Is it a sample arithmetic average or because these are probabilities on PHRED scale they have to be computed like this here.

Advanced thanks for helping with this.

MAPQ BAM coverage • 562 views
ADD COMMENT
0
Entering edit mode

Normalized depth at every base, across all samples (depth at a base/median autosomal depth).

Use tools like pandepth and mosdepth to do this.

Average MAPQ at every base in the genome across all aligned reads, across all samples

MAPQ values are for a given read alignment. They are not per base so not sure how you are going to calculate that.

ADD REPLY
0
Entering edit mode

GenoMax Thanks for the depth tools. I will take a look at them. With MAPQ, I want consider all reads that overlap a base to compute average MAPQ at that base. If there are 35 reads at a base, and all reads that overlap that base have MAPQ 60, then I want consider the average MAPQ as 60.

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6