Gatk Error: Samfilereader Appears To Be Using The Wrong Encoding For Quality Scores
2
3
Entering edit mode
10.7 years ago

I am testing some variant callers on the following dataset:

http://www.ebi.ac.uk/ena/data/view/SRP019719

I've had no problem running my same scripts with some 1000 genomes data that I have downloaded. I also have no problem running VarScan on the .bam files in this dataset. However,I get the following error when I try to run GATK functions (HaplotypeCaller, UnifiedGenotyper, RealignerTargetCreater/IndelRealigner, and BaseRecalibrator) on this dataset:

ERROR MESSAGE: SAM/BAM file SAMFileReader{/path/to/SRR796877.sort.karyotype.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 61; please see the GATK --help documentation for options related to this error

When looking at the GATK forums, it looks like there are at least 2 likely solutions:

1) Add "--fix_misencoded_quality_scores -fixMisencodedQuals" to the commands (see http://gatkforums.broadinstitute.org/discussion/1991/version-highlights-for-gatk-version-2-3)

2) Add "-allowPotentiallyMisencodedQuals" to the commands (see http://gatkforums.broadinstitute.org/discussion/2500/depthofcoverage-producing-extremely-high-quality-score-error or http://gatkforums.broadinstitute.org/discussion/2335/realignment-with-high-base-qualities)

I'll see what happens with these both of these strategies on one sample, but I'm wondering what other people have done when they have encountered this error. Do you have a preference? Are there additional, better options that I'm not listing?

In this specific case, there is an associated publication with the data, so I can probably ask the authors what they did. However, I'm wondering what is generally the best solution (and/or if this has been an issue for a lot of people).

gatk exome • 8.9k views
ADD COMMENT
0
Entering edit mode

just to clarify is the data in one of the older encodings? if so I would just have the system shift the data to the right encoding

ADD REPLY
0
Entering edit mode

Thanks for the feedback. I think the short answer is that using the command to simply subtract 33 from all scores is probably the best solution

I'm have to admit that I needed to look up the different quality scoring metrics (with links shown below, if helpful to others like myself):

How to determine the version used to generate Solexa/Illumina fastq files?

http://en.wikipedia.org/wiki/FASTQ_format

I can see that scores 64 (') and 65 (a) are used in the .fastq file, and they at least go up to 67 (c) based upon looking at the first few lines. I also see that quality scores of 63 (_) are used, and one of the responses to the earlier Biostar post says that this means it must be using Solexa/Illumina 1.0. So, I think the answer is "yes" - it is using the older encoding.

ADD REPLY
4
Entering edit mode
10.6 years ago

I think "--fix_misencoded_quality_scores -fixMisencodedQuals" is generally the best solution.

The only exception that I have found is that you need to use "-allowPotentiallyMisencodedQuals" for the "PrintReads" step of base recalibration (following running BaseRecalibrator with "--fix_misencoded_quality_scores -fixMisencodedQuals")

ADD COMMENT
2
Entering edit mode
7.9 years ago
rajbtpatidar ▴ 20

I noticed for some of my fastq files even though the encoding is Illumina 1.9 there are some bases with quality score above 41, the best solution is the used sed or something to replace the quality score for all such bases with a score of 41 or below and rerun the pipeline.

zcat ${sample}_R1.fastq.gz |sed -e '4~4y/KLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~/""""""""""""""""""""""""""""""""""""""""""""""""""""/' |gzip >${sample}_R1.fixed.fastq.gz &
zcat ${sample}_R2.fastq.gz |sed -e  '4~4y/KLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~/""""""""""""""""""""""""""""""""""""""""""""""""""""/' |gzip >${sample}_R2.fixed.fastq.gz & 
wait

Once the fastq files are fixed, GATK will run without any issue.

ADD COMMENT

Login before adding your answer.

Traffic: 878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6