I received some data from a third party provider where the FASTQ files have scores encoded in the range from B (ascii 66) to i (ascii 105). This range is not described in the Wikipedia entry on the FASTQ format, so is this range valid?
I received some data from a third party provider where the FASTQ files have scores encoded in the range from B (ascii 66) to i (ascii 105). This range is not described in the Wikipedia entry on the FASTQ format, so is this range valid?
EDIT 2: This is not the correct answer (see EDIT below) and it should therefore not have been upvoted. Please see Istvan's answer below.
Actually, this range is valid and is mentioned in the Wikipedia article you cite. This looks like Illumina 1.3-1.7 with an ASCII offset of 64. So B translates to 2 (a special value marking nucleotides that should be ignored) and i to 41 (EDIT 1: sorry, said initially 39. And 41 is actually not expected). Here's the relevant part from the section "Encoding":
Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
Andreas
Capping at the maximal quality value of 40 is a convention that by now most instruments adopted. Technically the Phred quality scores go from 0 to 93. So the use of the quality 'i' does not necessarily indicate a problem.
That being said it is a bit suspicious when you see quality scores that are just out of the usual range. Plus this looks like one of the older quality encodings. But if so you may have other problems, the probability formula was defined slightly differently for some of these encodings thus the values are not directly comparable anyhow. (This is how I recall it)
Peter s paper has more details on the nitty gritty The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucl. Acids Res. (2010)
As there is no actual standard for FASTQ there is no possibility to say what is a "valid" FASTQ file. It all depends on what the tools you want to use will expect and accept.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is FASTQ data from what I believe is Illumina sequencing and processing with the Illumina 1.5+ pipeline (that remains to be confirmed).