I am looking at SRR015016 from the SRA.
I am trying to understand the encoding of the base quality used in this file.
The instrument model was Illumina Genome Analyzer II. However, the quality scheme is somewhat peculiar.
I have run the useful utility usearch -fastq_chars
to see the read quality distribution.
Char ASCII Q(33) Q(64) Tails Total Freq AccFrq
---- ----- ----- ----- ---------- ---------- ------- -------
'!' 33 0 -31 0 2906 0.01% 0.01%
'"' 34 1 -30 0 0 0.00% 0.01%
'#' 35 2 -29 0 0 0.00% 0.01%
'$' 36 3 -28 0 0 0.00% 0.01%
'%' 37 4 -27 0 0 0.00% 0.01%
'&' 38 5 -26 0 0 0.00% 0.01%
''' 39 6 -25 0 0 0.00% 0.01%
'(' 40 7 -24 0 0 0.00% 0.01%
')' 41 8 -23 0 0 0.00% 0.01%
'*' 42 9 -22 0 0 0.00% 0.01%
'+' 43 10 -21 0 0 0.00% 0.01%
',' 44 11 -20 0 0 0.00% 0.01%
'-' 45 12 -19 0 0 0.00% 0.01%
'.' 46 13 -18 0 0 0.00% 0.01%
'/' 47 14 -17 0 0 0.00% 0.01%
'0' 48 15 -16 0 0 0.00% 0.01%
'1' 49 16 -15 0 0 0.00% 0.01%
'2' 50 17 -14 0 0 0.00% 0.01%
'3' 51 18 -13 0 0 0.00% 0.01%
'4' 52 19 -12 0 0 0.00% 0.01%
'5' 53 20 -11 0 0 0.00% 0.01%
'6' 54 21 -10 0 0 0.00% 0.01%
'7' 55 22 -9 0 0 0.00% 0.01%
'8' 56 23 -8 0 8 0.00% 0.01%
'9' 57 24 -7 0 0 0.00% 0.01%
':' 58 25 -6 0 745 0.00% 0.01%
';' 59 26 -5 0 0 0.00% 0.01%
'<' 60 27 -4 0 0 0.00% 0.01%
'=' 61 28 -3 0 391 0.00% 0.01%
'>' 62 29 -2 0 0 0.00% 0.01%
'?' 63 30 -1 1 15 0.00% 0.01%
'@' 64 31 0 3 2928 0.01% 0.02%
'A' 65 32 1 0 2980 0.01% 0.04%
'B' 66 33 2 0 0 0.00% 0.04%
'C' 67 34 3 144 37529 0.13% 0.17%
'D' 68 35 4 3596 351835 1.24% 1.41%
'E' 69 36 5 1460 274975 0.97% 2.38%
'F' 70 37 6 6 121914 0.43% 2.82%
'G' 71 38 7 23 312858 1.11% 3.92%
'H' 72 39 8 39 244877 0.87% 4.79%
'I' 73 40 9 30 264438 0.93% 5.72%
'J' 74 41 10 27 220404 0.78% 6.50%
'K' 75 42 11 46 306755 1.08% 7.59%
'L' 76 43 12 34 258150 0.91% 8.50%
'M' 77 44 13 92 329095 1.16% 9.66%
'N' 78 45 14 83 326684 1.16% 10.82%
'O' 79 46 15 91 365324 1.29% 12.11%
'P' 80 47 16 87 423488 1.50% 13.61%
'Q' 81 48 17 76 442600 1.56% 15.17%
'R' 82 49 18 160 403789 1.43% 16.60%
'S' 83 50 19 220 541710 1.92% 18.51%
'T' 84 51 20 137 594089 2.10% 20.61%
'U' 85 52 21 44 615082 2.17% 22.79%
'V' 86 53 22 208 568834 2.01% 24.80%
'W' 87 54 23 3535 298227 1.05% 25.85%
'X' 88 55 24 694 136779 0.48% 26.34%
'Y' 89 56 25 9816 784561 2.77% 29.11%
'Z' 90 57 26 66100 16468153 58.22% 87.34%
'[' 91 58 27 1137 3517684 12.44% 99.77%
'\' 92 59 28 0 0 0.00% 99.77%
']' 93 60 29 0 0 0.00% 99.77%
'^' 94 61 30 0 0 0.00% 99.77%
'_' 95 62 31 0 64281 0.23% 100.00%
I see that the majority of the ASCII codes come from ASCII values of 89-90, beginning at ASCII values of 61. This seems to correspond generally to Solexa/Early illumina
Description ASCII Range ASCII Offset Quality score
fastq-solexa 59–126 64 −5 to 62
However, there are two differences. The first is the '!' sign which is the lowest score according to phred33. I don't see why it appears in the Solexa format.
The second difference consists a few occurrences of '8' which correspond to a Solexa quality of -8.
A Solexa score can receive negative values. However, the occurence of the values of scores -8, and -31 (the score of '!') makes me wonder - is it a Solexa score, and what it is, if not.
You can find the valid ranges of fastq scores in this WikiPedia article. Solexa encoded scores are between -5 and 40.
The file I looked at has a range which does not suit any of the illumina scores in the article
Can you run
testformat.sh
from BBMap suite on this file and post the result.Edit:
Test format seems to think that this is Illumina encoded data. Phred+64 but it could be Illumina 1.3 or 1.5.