Question

What is the precise range that Solexa quality scores can take?

0

Entering edit mode

3.4 years ago

Aspire ▴ 370

I am looking at SRR015016 from the SRA.

I am trying to understand the encoding of the base quality used in this file.

The instrument model was Illumina Genome Analyzer II. However, the quality scheme is somewhat peculiar.

I have run the useful utility usearch -fastq_chars to see the read quality distribution.

Char  ASCII  Q(33)  Q(64)       Tails       Total     Freq   AccFrq
----  -----  -----  -----  ----------  ----------  -------  -------
 '!'     33      0    -31           0        2906    0.01%    0.01%
 '"'     34      1    -30           0           0    0.00%    0.01%
 '#'     35      2    -29           0           0    0.00%    0.01%
 '$'     36      3    -28           0           0    0.00%    0.01%
 '%'     37      4    -27           0           0    0.00%    0.01%
 '&'     38      5    -26           0           0    0.00%    0.01%
 '''     39      6    -25           0           0    0.00%    0.01%
 '('     40      7    -24           0           0    0.00%    0.01%
 ')'     41      8    -23           0           0    0.00%    0.01%
 '*'     42      9    -22           0           0    0.00%    0.01%
 '+'     43     10    -21           0           0    0.00%    0.01%
 ','     44     11    -20           0           0    0.00%    0.01%
 '-'     45     12    -19           0           0    0.00%    0.01%
 '.'     46     13    -18           0           0    0.00%    0.01%
 '/'     47     14    -17           0           0    0.00%    0.01%
 '0'     48     15    -16           0           0    0.00%    0.01%
 '1'     49     16    -15           0           0    0.00%    0.01%
 '2'     50     17    -14           0           0    0.00%    0.01%
 '3'     51     18    -13           0           0    0.00%    0.01%
 '4'     52     19    -12           0           0    0.00%    0.01%
 '5'     53     20    -11           0           0    0.00%    0.01%
 '6'     54     21    -10           0           0    0.00%    0.01%
 '7'     55     22     -9           0           0    0.00%    0.01%
 '8'     56     23     -8           0           8    0.00%    0.01%
 '9'     57     24     -7           0           0    0.00%    0.01%
 ':'     58     25     -6           0         745    0.00%    0.01%
 ';'     59     26     -5           0           0    0.00%    0.01%
 '<'     60     27     -4           0           0    0.00%    0.01%
 '='     61     28     -3           0         391    0.00%    0.01%
 '>'     62     29     -2           0           0    0.00%    0.01%
 '?'     63     30     -1           1          15    0.00%    0.01%
 '@'     64     31      0           3        2928    0.01%    0.02%
 'A'     65     32      1           0        2980    0.01%    0.04%
 'B'     66     33      2           0           0    0.00%    0.04%
 'C'     67     34      3         144       37529    0.13%    0.17%
 'D'     68     35      4        3596      351835    1.24%    1.41%
 'E'     69     36      5        1460      274975    0.97%    2.38%
 'F'     70     37      6           6      121914    0.43%    2.82%
 'G'     71     38      7          23      312858    1.11%    3.92%
 'H'     72     39      8          39      244877    0.87%    4.79%
 'I'     73     40      9          30      264438    0.93%    5.72%
 'J'     74     41     10          27      220404    0.78%    6.50%
 'K'     75     42     11          46      306755    1.08%    7.59%
 'L'     76     43     12          34      258150    0.91%    8.50%
 'M'     77     44     13          92      329095    1.16%    9.66%
 'N'     78     45     14          83      326684    1.16%   10.82%
 'O'     79     46     15          91      365324    1.29%   12.11%
 'P'     80     47     16          87      423488    1.50%   13.61%
 'Q'     81     48     17          76      442600    1.56%   15.17%
 'R'     82     49     18         160      403789    1.43%   16.60%
 'S'     83     50     19         220      541710    1.92%   18.51%
 'T'     84     51     20         137      594089    2.10%   20.61%
 'U'     85     52     21          44      615082    2.17%   22.79%
 'V'     86     53     22         208      568834    2.01%   24.80%
 'W'     87     54     23        3535      298227    1.05%   25.85%
 'X'     88     55     24         694      136779    0.48%   26.34%
 'Y'     89     56     25        9816      784561    2.77%   29.11%
 'Z'     90     57     26       66100    16468153   58.22%   87.34%
 '['     91     58     27        1137     3517684   12.44%   99.77%
 '\'     92     59     28           0           0    0.00%   99.77%
 ']'     93     60     29           0           0    0.00%   99.77%
 '^'     94     61     30           0           0    0.00%   99.77%
 '_'     95     62     31           0       64281    0.23%  100.00%

I see that the majority of the ASCII codes come from ASCII values of 89-90, beginning at ASCII values of 61. This seems to correspond generally to Solexa/Early illumina

Description   ASCII Range      ASCII Offset    Quality score


fastq-solexa      59–126           64     −5 to 62

However, there are two differences. The first is the '!' sign which is the lowest score according to phred33. I don't see why it appears in the Solexa format.

The second difference consists a few occurrences of '8' which correspond to a Solexa quality of -8.

A Solexa score can receive negative values. However, the occurence of the values of scores -8, and -31 (the score of '!') makes me wonder - is it a Solexa score, and what it is, if not.

phred solexa quality-score fastq • 1.7k views

ADD COMMENT • link 3.4 years ago by Aspire ▴ 370

0

Entering edit mode

You can find the valid ranges of fastq scores in this WikiPedia article. Solexa encoded scores are between -5 and 40.

ADD REPLY • link 3.4 years ago by GenoMax 150k

0

Entering edit mode

The file I looked at has a range which does not suit any of the illumina scores in the article

ADD REPLY • link 3.4 years ago by Aspire ▴ 370

0

Entering edit mode

Can you run testformat.sh from BBMap suite on this file and post the result.

Edit:

$ testformat.sh SRR015016.fastq 
illumina        fastq   raw     single-ended    28bp

Test format seems to think that this is Illumina encoded data. Phred+64 but it could be Illumina 1.3 or 1.5.

ADD REPLY • link 3.4 years ago by GenoMax 150k

score 0 · Answer 1 · 2021-11-04

The encoding is compatible with a

P - PacBio        Phred+33,  HiFi reads typically (0, 93)

See the Wikipedia page that GenoMax also linked. Though the sequencing run is from 2010 and is not on PacBio

I think what happened is that the upper limit was not truncated to 40 as typical. Thus this is probably still a Sanger encoding just goes past the normal limit.

In my opinion, assuming a Sanger encoding won't cause any problems. The meaning of the scores stays the same.

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ.....................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126
  0........................26...31.......40                                
                           -5....0........9.............................40 
                                 0........9.............................40 
                                    3.....9..............................41 
  0.2......................26...31........41                              
  0..................20........30........40........50..........................................93

Legend:

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 41)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)
 P - PacBio        Phred+33,  HiFi reads typically (0, 93)