Tool To Find Out If Fastq Is In Sanger Or Phred64 Encoding?
9
16
Entering edit mode
11.9 years ago

Is there a simple tool I can use to quickly find out if a FASTQ file is in Sanger or Phred64 encoding? Ideally something that tells me 'Encoding XX' somewhere the terminal output.

fastq • 53k views
ADD COMMENT
0
Entering edit mode

The tool FastQC has a good guesser. Or use the following perl script: fastqFormatDetect.pl

Both base their results according to the characters encountered within the score line of the fastq file. It's well explained above or on the fastq wiki page.

ADD REPLY
0
Entering edit mode

That link is too old and gives 404

ADD REPLY
0
Entering edit mode

I'm looking for the new URL ... nevertheless I found a Github that had a copy of it. I modified the link accordingly.

ADD REPLY
16
Entering edit mode
ADD COMMENT
2
Entering edit mode

Thanks, that worked:

gunzip -c file.fastq.gz | awk 'NR % 4 == 0' | head -n 1000000 | python ./guess-encoding.py

ADD REPLY
2
Entering edit mode

note that you can just send -n 100000 as an argument to guess-encoding.py

ADD REPLY
0
Entering edit mode

guess-encoding.py needs to be updated

ADD REPLY
0
Entering edit mode

It seems guess-encoding.py has a misleading example, suggesting cut -f 5 instead of cut -f 11 to grab quality strings.

ADD REPLY
0
Entering edit mode

I used the guess-encoding.py on my fastq file and the output is like:

reading qualities from STDIN

Illumina-1.8 55 74

but it is not clear to me how to realise which phred score it is? may you please guide me

ADD REPLY
10
Entering edit mode
9.6 years ago
Medhat 9.8k
head -n 40 file.fastq | \
  awk '{if(NR%4==0) printf("%s",$0);}' | \
  od -A n -t u1 | \
  awk '
    BEGIN
    {
      min=100;
      max=0;
    }

    {
      for(i=1;i<=NF;i++) {
        if($i>max) max=$i;
        if($i<min) min=$i;
      }
    }

    END
    {
      if(max<=74 && min<59) print "Phred+33";
      else if(max>73 && min>=64) print "Phred+64";
      else if(min>=59 && min<64 && max>73) print "Solexa+64";
      else print "Unknown score encoding\!";
    }
    '

source

ADD COMMENT
8
Entering edit mode
11.9 years ago
Irsan ★ 7.8k

If the quality scores contain character 0 it is either Sanger phred+33 or Illumina 1.8+ phred+33. When they also contain the character J, it is Illumina 1.8+ phred 33, otherwise it is Sanger phred + 33.

When the quality scores do not contain 0, it is either Solexa +64, Illumina 1.3+ Phred+64, Illumina 1.5+ Phred+64.

Then it is Solexa +64 when it contains character =

It is Illumina 1.3 phred + 64 when it contains A

It is Illumina 1.5 phred +64 when it contains no A or =

Take a look at the wiki and try to understand the table

ADD COMMENT
6
Entering edit mode
11.9 years ago
toni ★ 2.2k

You can use this tool: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

It has an internal automatic guesser.

T.

ADD COMMENT
0
Entering edit mode

thank you so much. this is so easy for a beginner like me.

ADD REPLY
5
Entering edit mode
9.6 years ago

BBMap as a little tool for this:

$ testformat.sh in=N0174.fq.gz
sanger    fastq    gz    interleaved    150bp
ADD COMMENT
0
Entering edit mode

Hello Brian,

I was doing some tests, with the multiple solutions provided in this post, on this file ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527007/ERR527007_1.fastq.gz, which has an apparently incorrect ASCII encoding, based on BBDuk report.

Surprisingly, only your solution reports the error (i.e. Exception in thread "main" java.lang.AssertionError: ASCII encoding for quality (currently ASCII-33) appears to be wrong)

The rest do report a proper ASCII encoding. I.e:

  • gunzip -c ERR527007_1.fastq.gz | awk 'NR % 4 == 0' | head -n 100000 | python ./guess-encoding.py

Illumina-1.8 35 74

  • gunzip -c ERR527007_1.fastq.gz | head -n 40 | awk '{if(NR%4==0) printf("%s",$0);}' | od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding\!";}'

Phred+33

Why does this happen? Should we, in general, be concerned about ACII-encoding issues not being detected on modern Illumina-generated files? I know that this is probably digging into a not-worth problem, as the sequencing file is pretty old and this issue is most likely not portable to more recent sequencing files. I am just asking because I spent quite a bit of time on these files, and I am a bit confused on how different strategies to detect ASCII-encodings report different outputs.

Thank you very much in advance, and sorry for the long question (and for reviving this pretty old post).

ADD REPLY
2
Entering edit mode
11.9 years ago
Gvj ▴ 470

If you are searching for a quick dirty method, then just grep for any Sanger or Phred64 unique character. You can find it http://en.wikipedia.org/wiki/FASTQ_format

grep Z filename # for Phred64 and make sure that the lines are not headers

ADD COMMENT
2
Entering edit mode
8.7 years ago
n.caillou ▴ 50

As noted by medhat above, GNU od or hexdump can be used to convert the quality scores to their decimal value, so

 cat file.fq | awk 'NR%4==0' | tr -d '\n' | hexdump -v -e'/1 "%u\n"' | sort -nu

will display which (decimal) quality scores exist in your file.

According to brentp's "guess-encoding.py" script the possible ranges are 33-93 (Sanger/Illumina1.8), 64-104 (Illumina1.3 or Illumina1.5) and 59-104 (Solexa). Similarly FastQC assumes that anything with some scores in the 33-63 range is Sanger and that the rest is Illumina1.3-1.5 (it doesn't know about Solexa scores).

ADD COMMENT
1
Entering edit mode
8.4 years ago
Shicheng Guo ★ 9.6k

Install BBMap and then use the following script:

Usage:  reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2>

reformat.sh in=Indx07.read1.fq out=Indx07.read1.phred33.fq qin=64 qout=33
reformat.sh in=Indx07.read2.fq out=Indx07.read2.phred33.fq qin=64 qout=33
ADD COMMENT
1
Entering edit mode
7.2 years ago
ando.kelli ▴ 60

Hey there, if you run FastQC you can see the quality format in the main output screen, in the section marked "Encoding"

ADD COMMENT

Login before adding your answer.

Traffic: 3754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6