What is the precise range that Solexa quality scores can take?
1
0
Entering edit mode
3.1 years ago
Aspire ▴ 370

I am looking at SRR015016 from the SRA.

I am trying to understand the encoding of the base quality used in this file.

The instrument model was Illumina Genome Analyzer II. However, the quality scheme is somewhat peculiar.

I have run the useful utility usearch -fastq_chars to see the read quality distribution.

Char  ASCII  Q(33)  Q(64)       Tails       Total     Freq   AccFrq
----  -----  -----  -----  ----------  ----------  -------  -------
 '!'     33      0    -31           0        2906    0.01%    0.01%
 '"'     34      1    -30           0           0    0.00%    0.01%
 '#'     35      2    -29           0           0    0.00%    0.01%
 '$'     36      3    -28           0           0    0.00%    0.01%
 '%'     37      4    -27           0           0    0.00%    0.01%
 '&'     38      5    -26           0           0    0.00%    0.01%
 '''     39      6    -25           0           0    0.00%    0.01%
 '('     40      7    -24           0           0    0.00%    0.01%
 ')'     41      8    -23           0           0    0.00%    0.01%
 '*'     42      9    -22           0           0    0.00%    0.01%
 '+'     43     10    -21           0           0    0.00%    0.01%
 ','     44     11    -20           0           0    0.00%    0.01%
 '-'     45     12    -19           0           0    0.00%    0.01%
 '.'     46     13    -18           0           0    0.00%    0.01%
 '/'     47     14    -17           0           0    0.00%    0.01%
 '0'     48     15    -16           0           0    0.00%    0.01%
 '1'     49     16    -15           0           0    0.00%    0.01%
 '2'     50     17    -14           0           0    0.00%    0.01%
 '3'     51     18    -13           0           0    0.00%    0.01%
 '4'     52     19    -12           0           0    0.00%    0.01%
 '5'     53     20    -11           0           0    0.00%    0.01%
 '6'     54     21    -10           0           0    0.00%    0.01%
 '7'     55     22     -9           0           0    0.00%    0.01%
 '8'     56     23     -8           0           8    0.00%    0.01%
 '9'     57     24     -7           0           0    0.00%    0.01%
 ':'     58     25     -6           0         745    0.00%    0.01%
 ';'     59     26     -5           0           0    0.00%    0.01%
 '<'     60     27     -4           0           0    0.00%    0.01%
 '='     61     28     -3           0         391    0.00%    0.01%
 '>'     62     29     -2           0           0    0.00%    0.01%
 '?'     63     30     -1           1          15    0.00%    0.01%
 '@'     64     31      0           3        2928    0.01%    0.02%
 'A'     65     32      1           0        2980    0.01%    0.04%
 'B'     66     33      2           0           0    0.00%    0.04%
 'C'     67     34      3         144       37529    0.13%    0.17%
 'D'     68     35      4        3596      351835    1.24%    1.41%
 'E'     69     36      5        1460      274975    0.97%    2.38%
 'F'     70     37      6           6      121914    0.43%    2.82%
 'G'     71     38      7          23      312858    1.11%    3.92%
 'H'     72     39      8          39      244877    0.87%    4.79%
 'I'     73     40      9          30      264438    0.93%    5.72%
 'J'     74     41     10          27      220404    0.78%    6.50%
 'K'     75     42     11          46      306755    1.08%    7.59%
 'L'     76     43     12          34      258150    0.91%    8.50%
 'M'     77     44     13          92      329095    1.16%    9.66%
 'N'     78     45     14          83      326684    1.16%   10.82%
 'O'     79     46     15          91      365324    1.29%   12.11%
 'P'     80     47     16          87      423488    1.50%   13.61%
 'Q'     81     48     17          76      442600    1.56%   15.17%
 'R'     82     49     18         160      403789    1.43%   16.60%
 'S'     83     50     19         220      541710    1.92%   18.51%
 'T'     84     51     20         137      594089    2.10%   20.61%
 'U'     85     52     21          44      615082    2.17%   22.79%
 'V'     86     53     22         208      568834    2.01%   24.80%
 'W'     87     54     23        3535      298227    1.05%   25.85%
 'X'     88     55     24         694      136779    0.48%   26.34%
 'Y'     89     56     25        9816      784561    2.77%   29.11%
 'Z'     90     57     26       66100    16468153   58.22%   87.34%
 '['     91     58     27        1137     3517684   12.44%   99.77%
 '\'     92     59     28           0           0    0.00%   99.77%
 ']'     93     60     29           0           0    0.00%   99.77%
 '^'     94     61     30           0           0    0.00%   99.77%
 '_'     95     62     31           0       64281    0.23%  100.00%

I see that the majority of the ASCII codes come from ASCII values of 89-90, beginning at ASCII values of 61. This seems to correspond generally to Solexa/Early illumina

Description   ASCII Range      ASCII Offset    Quality score


fastq-solexa      59–126           64     −5 to 62

However, there are two differences. The first is the '!' sign which is the lowest score according to phred33. I don't see why it appears in the Solexa format.

The second difference consists a few occurrences of '8' which correspond to a Solexa quality of -8.

A Solexa score can receive negative values. However, the occurence of the values of scores -8, and -31 (the score of '!') makes me wonder - is it a Solexa score, and what it is, if not.

phred solexa quality-score fastq • 1.5k views
ADD COMMENT
0
Entering edit mode

You can find the valid ranges of fastq scores in this WikiPedia article. Solexa encoded scores are between -5 and 40.

ADD REPLY
0
Entering edit mode

The file I looked at has a range which does not suit any of the illumina scores in the article

ADD REPLY
0
Entering edit mode

Can you run testformat.sh from BBMap suite on this file and post the result.

Edit:

$ testformat.sh SRR015016.fastq 
illumina        fastq   raw     single-ended    28bp

Test format seems to think that this is Illumina encoded data. Phred+64 but it could be Illumina 1.3 or 1.5.

ADD REPLY
0
Entering edit mode
3.1 years ago

The encoding is compatible with a

P - PacBio        Phred+33,  HiFi reads typically (0, 93)

See the Wikipedia page that GenoMax also linked. Though the sequencing run is from 2010 and is not on PacBio

I think what happened is that the upper limit was not truncated to 40 as typical. Thus this is probably still a Sanger encoding just goes past the normal limit.

In my opinion, assuming a Sanger encoding won't cause any problems. The meaning of the scores stays the same.

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ.....................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126
  0........................26...31.......40                                
                           -5....0........9.............................40 
                                 0........9.............................40 
                                    3.....9..............................41 
  0.2......................26...31........41                              
  0..................20........30........40........50..........................................93

Legend:

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 41)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)
 P - PacBio        Phred+33,  HiFi reads typically (0, 93)
ADD COMMENT

Login before adding your answer.

Traffic: 1598 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6