I have a set of bam files that include reads with quality scores in a mix of incompatible formats. This includes Solexa, Illumina 1.5, and Sanger encoding. The reads were aligned with bwa
, so as far as I understand the misspecified quality scores didn't affect the alignment and were just copied through to the final bam file.
Now I'd like to call variants with these datasets using samtools mpileup
, but I am stuck because the quality scores in all input files need to have the same encoding. That is, I am aware of the -6
flag to mpileup
that would work if all the samples were in Illumina 1.5 format, but it's not applicable since I have a mix.
So my question is: can I do anything better than the brute force approaches of (1) fixing the original fastq files and realigning or (2) mucking through the bam files and changing quality scores myself with e.g. pysam
?
Maybe merging the three data sets isn't a good idea, I think that you cannot improve your variant calling because each technology has particular bias in errors and coverages. I suggest you to analyze each set independently and report the combination of high quality calls in each data set.
Point well taken, and I'll be on the lookout for that. In this instance, the mixing is more a function of data source (SRA with Sanger quality scores versus in-house sequencing with Illumina 1.5 quality scores) than technology (but certainly there are center-specific effects that might be at play).