Question

How To Determine The Version Used To Generate Solexa/Illumina Fastq Files?

11

Entering edit mode

14.7 years ago

Gurado ▴ 280

The Geo database contains an abundance of raw sequence tag files in fastq format some of which are generated by Solexa/Illumina NGS. Since Solexa/Illumina decided to change their own standard from version 1.3, both of which are not compatible to the sanger format, there exist currently 3 separate fastq definitions. I was wondering if there is any (easy) way to determine, which version was actually used to generate the fastq file (in particular for Solexa/Illumina since the platform is usually stated). Thats something you have to provide to nearly any aligner so that piece of information seems rather valuable.

next-gen sequencing fastq • 18k views

ADD COMMENT • link updated 14.7 years ago by Dan ▴ 540 • written 14.7 years ago by Gurado ▴ 280

0

Entering edit mode

You should check, but I believe that the NCBI SRA archive, which actually hosts the FASTQ (not NCBI GEO), is claiming to have converted the FASTQ files into sanger standard FASTQ. Once the data are there, the idea is that you needn't worry about the conversion as it is supposed to have been done.

ADD REPLY • link 14.7 years ago by Sean Davis 27k

score 7 · Answer 1 · 2010-04-29

7

Entering edit mode

14.7 years ago

Phis ★ 1.1k

Apparently, there's now 4 different fastq encodings, with a new Illumina 1.5+ one, which doesn't make your task easier. So just looking at the fastq files themselves, without any additional information specifying with which software they were generated (or without the possibility to contact the people who did it), I don't see a general mechanism for finding out, except for special cases:

[Edited for clarification in response to comment:]

If the quality scores contain characters in the range ASCII 33 - 58 -> can only be Sanger

If FastQ file is known to be from an Illumina/Solexa platform AND the quality scores contain characters in the range ASCII 59 - 63 -> can only be Solexa/Illumina 1.0

If ASCII characters 64 or 65 are used in quality scores -> cannot be Illumina 1.5+

ADD COMMENT • link 14.6 years ago by Phis ★ 1.1k

2

Entering edit mode

Oh my god ... 4 encodings and three from Illumina without them adding a header specifying the concept applied. We should seriously punish Illumina for their repeated crimes against the bioinformatic community! A simple but effective measure would be to reject anything for publication that uses Illumina platforms ;)

ADD REPLY • link 14.7 years ago by Gurado ▴ 280

0

Entering edit mode

Sounds like with the same principle, you could reject any project that used Microsoft Word or Excel. It seems to me that it should be trivial to parse the first lines of the fastq file and determine which version was used. (I agree that they should have add a line specifying it, but I am just saying)

ADD REPLY • link 14.6 years ago by Nico ▴ 190

0

Entering edit mode

That's not true: "If the quality scores contain characters in the range ASCII 59 - 63 -> can only be Solexa/Illumina 1.0"

It can be a Sanger FASTQ file with very good scores (e.g. a contig).

ADD REPLY • link 14.6 years ago by Peter 6.0k

0

Entering edit mode

@Peter: you're right - I didn't make it clear I was talking about the non-Sanger encodings. I edited/expanded it to make it clearer.

ADD REPLY • link 14.6 years ago by Phis ★ 1.1k

score 7 · Answer 2 · 2010-04-29

7

Entering edit mode

14.7 years ago

Casbon ★ 3.3k

See Peter Cock's work on FastQ at http://github.com/biopython/biopython/blob/master/Bio/SeqIO/QualityIO.py

Start reading from:

It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" for the more recent variant), as this cannot be detected reliably automatically.'

ADD COMMENT • link 14.7 years ago by Casbon ★ 3.3k

0

Entering edit mode

i agree it cannot be detected reliably but it does Bio.SeqIO throw errors when the values are off of the provided scale?

ADD REPLY • link 14.7 years ago by Jeremy Leipzig 22k

Ram · Answer 3 · 2011-09-21

3

Entering edit mode

13.3 years ago

Marina Manrique ★ 1.3k

SolexaQA does that exactly (among many other fancy things), just type

solexaqa reads.fastq

and you will get the fastq format of the file: Illumina FASTQ format, Illumina pipeline 1.3+, Sanger FASTQ format, etc.

What I don't know is that if you need R installed for this functionality or it's not necessary.

HTH,

Marina

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 13.3 years ago by Marina Manrique ★ 1.3k

0

Entering edit mode

Thanks a lot Marina for that link! I used the subroutine "getformat" in my perl script and it works great. Now I my wrapper can call Stampy with the appropriate FASTQ format.

ADD REPLY • link 13.1 years ago by Ngsfan ▴ 30

Ram · Answer 4 · 2010-04-29

2

Entering edit mode

14.7 years ago

Jeremy Leipzig 22k

Someone should write a script that gives out likelihoods that a fastq file is encoded a certain way. At least that will help eliminate one of the encodings.

So if you see quality scores from B-a:

0% Sanger
80% Illumina (a good run)
20% Solexa (a bad run)

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

very good idea. cant wait for the bioinformatics paper on the posterior probability taking all current GEO data and their creation time into account.

ADD REPLY • link 13.3 years ago by Ido Tamir 5.2k

Ram · Answer 5 · 2012-09-18

Hmm... Once I got 'get_format' working, it reports Sanger, which is what fastQValidator seems to be using (Phred).

Here is the code in case anyone else gets stuck converting it out of solexaqa:

#!/usr/bin/perl

use strict;
use warnings;

my $format = "";

# set regular expressions
my $sanger_regexp = qr/[!"#$%&'()*+,-.\/0123456789:]/;
my $solexa_regexp = qr/[\;<=>\?]/;
my $solill_regexp = qr/[JKLMNOPQRSTUVWXYZ\[\]\^\_\`abcdefgh]/;
my $all_regexp = qr/[\@ABCDEFGHI]/;

# set counters
my $sanger_counter = 0;
my $solexa_counter = 0;
my $solill_counter = 0;


my $i;
while(<>){
    $i++;

    # retrieve qualities
    next unless $i % 4 eq 0;

    #print;
    chomp;

    # check qualities
    if( m/$sanger_regexp/ ){
        $sanger_counter = 1;
        last;
    }
    if( m/$solexa_regexp/ ){
        $solexa_counter = 1;
    }
    if( m/$solill_regexp/ ){
        $solill_counter = 1;
    }
}

# determine format
if( $sanger_counter ){
    $format = "sanger";
}
elsif( !$sanger_counter && $solexa_counter ){
    $format = "solexa";
    }
elsif( !$sanger_counter && !$solexa_counter && $solill_counter ){
    $format = "illumina";
}

print "$format\n";

Ram · Answer 6 · 2012-09-18

Assuming tool X expects version Y, what range of scores would you see given version Z?

I'm seeing results like the following from fastQValidator

Average Phred Quality by Read Index (starts at 0):
Read Index      Average Quality
0       30.14
1       9.44
2       8.88
3       9.17
4       8.89
5       8.47
6       20.36
7       18.86
8       21.23
9       22.53
10      20.64
11      20.89
12      17.91
13      20.48
14      16.72
15      21.26
16      20.06
17      21.02
18      31.05
19      18.09
20      16.62
21      29.66
22      17.08
23      16.29
24      30.37
25      28.24
26      25.93
27      25.00
28      27.13
29      26.40
30      12.63
31      13.78
32      22.34
33      13.77
34      11.67
35      12.24
36      11.75
37      20.82
38      21.13
39      19.89
40      18.43
41      18.72

Overall Average Phred Quality = 19.19
Finished processing puke2 with 4000 lines containing 1000 sequences.
There were a total of 0 errors.
Returning: 0 : FASTQ_SUCCESS