Changing Quality Scores Within Bam Files
2
3
Entering edit mode
12.5 years ago
matted 7.8k

I have a set of bam files that include reads with quality scores in a mix of incompatible formats. This includes Solexa, Illumina 1.5, and Sanger encoding. The reads were aligned with bwa, so as far as I understand the misspecified quality scores didn't affect the alignment and were just copied through to the final bam file.

Now I'd like to call variants with these datasets using samtools mpileup, but I am stuck because the quality scores in all input files need to have the same encoding. That is, I am aware of the -6 flag to mpileup that would work if all the samples were in Illumina 1.5 format, but it's not applicable since I have a mix.

So my question is: can I do anything better than the brute force approaches of (1) fixing the original fastq files and realigning or (2) mucking through the bam files and changing quality scores myself with e.g. pysam?

bam samtools mpileup fastq • 10k views
ADD COMMENT
0
Entering edit mode

Maybe merging the three data sets isn't a good idea, I think that you cannot improve your variant calling because each technology has particular bias in errors and coverages. I suggest you to analyze each set independently and report the combination of high quality calls in each data set.

ADD REPLY
1
Entering edit mode

Point well taken, and I'll be on the lookout for that. In this instance, the mixing is more a function of data source (SRA with Sanger quality scores versus in-house sequencing with Illumina 1.5 quality scores) than technology (but certainly there are center-specific effects that might be at play).

ADD REPLY
1
Entering edit mode
12.5 years ago

You wouldn't need to go through pysam. Just convert the BAM to SAM then rescale the 11th column and convert back to BAM. You won't need to even sort again probably not even index, though the latter is usually not that time consuming.

ADD COMMENT
0
Entering edit mode

Thanks, this is basically what I ended up doing. Since the bams were compressed, I did need to re-index however... the changed quality scores can mildly affect compression and change file offsets.

ADD REPLY
1
Entering edit mode
12.5 years ago
Lee Katz ★ 3.2k

I'm not positive, but you can probably use samtools merge. It'd be easy to tell if it works after looking at IGV or Tablet.

But then again, is it is that hard to write a script that parses the quality scores column and adds or subtracts 33? Here's my untested/pseudo code:

samtools view file.bam | \
perl -lane 'if('/^#/){print; next;}  
  @qual=split //, $F[$qualIndex]; $_=chr(ord($_)+33) for(@qual); $F[$qualIndex]=join("",@qual); print join("\t",@F)' | \
samtools -bS > file.corrected.bam
ADD COMMENT
0
Entering edit mode

the principle is correct but it will be more than just adding 33 - they will first need to detect from the read id which format the data is in then convert this encoded value to a numerical quality value and then re-encode that into Sanger encoding.

ADD REPLY
0
Entering edit mode

Right, thank you for clarifying. The OP already knows which BAM file has which encoding, and so the one-liner will have to be run for each file that he/she wants to convert. The end result would be a set of BAM files with one encoding scheme for quality scores.

ADD REPLY
2
Entering edit mode

Thanks, that's correct. I targeted a specific re-scaling to each bam. samtools merge won't change the quality scores so it doesn't help. Just a quick clarification to your example, which is basically what I did, is to output the header so that the conversion back to bam works:

samtools view -h in.bam | rescale_quals.py | samtools view -bS - > out.bam
ADD REPLY
0
Entering edit mode

Could you share this rescale_quals.py script? I am in a similar position. Thank!

ADD REPLY
0
Entering edit mode

Hi, where could I find the rescale_quals.py script? I am in the same position. Thanks a lot.

ADD REPLY
1
Entering edit mode

Here it is (sadly I'm too lazy to post to github or anything better...):

#!/usr/bin/python
import sys

def process(fname):
    fin = open(fname)
    for line in fin:
        if line[0] == '@':
            print line,
            continue
        line = line.strip().split("\t")
        if len(line[10]) != len(line[9]):
            print >>sys.stderr, "number of quality scores does not match number of bases: [%s] vs. [%s]" % (line[9], line[10])
            line[10] = 'O' * len(line[9])
        line[10] = "".join([chr(ord(c)-31) for c in line[10]])
        print "\t".join(line)
    fin.close()

if __name__ == "__main__":
    if len(sys.argv) <= 1:
        print >>sys.stderr, "usage (converts fastq 1.3/1.5 to sanger): samtools view -h in.bam | bam_rescale_quals.py - | samtools view -bS - > out.bam"
    else:
        if sys.argv[1] == "-":
            sys.argv[1] = "/dev/stdin"
        process(sys.argv[1])
ADD REPLY
0
Entering edit mode

Cool. Thanks a lot.

ADD REPLY
0
Entering edit mode

Just to clarify, this script converts the Illumina1.5+ (PHRED64+) to Sanger encoding (Phred33+), right?

ADD REPLY
0
Entering edit mode

"usage (converts fastq 1.3/1.5 to sanger): samtools view -h in.bam | bam_rescale_quals.py - | samtools view -bS - > out.bam"

So I guess, yea.

ADD REPLY

Login before adding your answer.

Traffic: 3983 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6