In our lab we are working with Illumina X ten samples for quitte some time.
Recently we were having a more indepth inspection on the sequences delivered from the Illumina X ten runs.
We were looking into the phred scores from several samples and found phred scores which exceed the set bounderies in the Illumina 1.8+ spec. (https://en.wikipedia.org/wiki/FASTQ_format)
The specs specifies the range from 0 .. 41
While in our samples we find something like the following:
@ST-E00294:24:H5375CCXX:5:1101:7384:1836 1:N:0
TCTATACCTATCAATTGTCCCGTANNNAGANCNTTCTCGNCTNCNNNTCTTCNNANNNNCCCNNTGTTATTCNCATCGACTTCCCCNNTTNTTNNNANNTGTAACCTNNTCNANNCCACCNNTGATTCCTTTTATTGGTCATCTTTAGTC
+
AAAF,KKAFKKFFFAKKA7F7F,,###AF,#F#7FF77,#AF#,###F<FKK##F####,7,##,A,,,,,K#KF,,,,,,,,<<,##,,#,A###<##KKA,,,77##,,#A##,7,,,##7FF7<,7<FKKFKKKK,,,<,,F<,,7,
You can see that phred-score K is in the quality string, which encodes for phred(q)=42
Anyone knows which spec the Illumina X ten is following or am I seeing a bug in the BaseSpace software for these machines?
As @Dan points out below scores >40 are legal: http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf
This document unfortunately says nothing about the range of acceptable values.
I don't think Illumina has any sort of company-wide standard on quality scores. They have as many sets of quality score meanings as they do versions of base-calling software. I've seen many Illumina files in which bases with a quality score of 0 (but still called with ACGT rather than N) were correct 100% of the time - higher than any other quality score. Sometimes 2 is "special", sometimes it isn't. Sometimes the values are binned, and the bins will change between software versions. The only constant is that none of them are ever calibrated, so the only reliable way to determine their meaning is through observation and measurement.
It's useful to be able to deal with values outside of what I consider the normal FASTQ range of 0-41 because there are some programs that violate that range. Read-merging and other error-correction tools are the worst offenders, which may give quality scores up to ASCII 99, or 122 (z), or 126, or whatever the programmer thought was best.
There is usually no reason to cap quality scores at any particular value (up to 126, which ends the printable range) except to solve a problem that Illumina singlehandedly created - their own inability to standardize on a quality scheme. They are the only organization to use ASCII-64 or ASCII-66 encodings (sometimes containing negative numbers, and thus dropping below 64 or 66). As a result, it will be forever difficult to auto-detect the quality-encoding format of Illumina data. The main reason for programs to act strangely upon reading quality scores over 41 is to prevent old Illumina ASCII-64/66 files from being processed as ASCII-33.
The lack of quality resolution incurred by capping things at Q41 is not overly important at present because no platform is capable of consistently delivering raw reads at >Q41. Aside from Illumina's Q0 non-N bases, which are frankly astonishing - they should aim for more of those.
Thank you for your elaborate answer. For our specific use-case, we are implementing a FastQ validator in our analysis pipeline to check the validity (and propagation to either continue / halt pipeline ).
As we take the assumption the range is 0..41, our setup with the validation is not working for the fastq files containing Q42 phred-scores. The challenge now is to write rules to properly identify the quality ranges (solexa, sanger, illumina 1.3/1.5/1.8+ + Q42-"spec")
I have the same concerns as you; sequencing companies that are not able to conform to their own specifications. Which makes the jobs of software developers / researchers challenging. How can we tell that we are comparing the same information as in this case the quality scale is moving from 0..41 to 0..42 (as Q42 represents a relative value of 1.025, setting a new ceiling?)
It is impossible to write a program that will always be able to correctly determine the quality-score encoding of fastq files. BBMap comes with a tool called testformat.sh that uses various heuristics to guess the quality encoding (and other things, like whether the reads are interleaved, whether they are fasta, fastq, or sam, etc), but it cannot be guaranteed to be correct, as the quality score ranges of different encodings overlap.
Sometimes you can be certain about the offset - if you encounter an N with a quality score of "!", it's ASCII-33. You still can't tell the specific software version, of course. Probably, if you scan far enough into a file, you will eventually encounter an N encoded in a way that makes the encoding certain. But, I've seen Illumina files with N's getting positive quality scores, so that's not certain either! They are rare, though. Illumina usually gives Ns a quality score of 0.
BBMap's TestFormat tool only looks at the first two reads, so it's very fast. But if you want to increase confidence, you could read the whole file and calculate the frequencies of quality assignments, and hopefully encounter Ns which uniquely identify the file's quality encoding. Actually, I should add that capability as an option...
Or, if you have financial clout, you could call Illumina and tell them to start using standards.