reading qualities from STDIN

Question

Tool To Find Out If Fastq Is In Sanger Or Phred64 Encoding?

16

Entering edit mode

11.8 years ago

14134125465346445 ★ 3.6k

Is there a simple tool I can use to quickly find out if a FASTQ file is in Sanger or Phred64 encoding? Ideally something that tells me 'Encoding XX' somewhere the terminal output.

fastq • 52k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 11.8 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

The tool FastQC has a good guesser. Or use the following perl script: fastqFormatDetect.pl

Both base their results according to the characters encountered within the score line of the fastq file. It's well explained above or on the fastq wiki page.

ADD REPLY • link 8.1 years ago by Juke34 8.9k

0

Entering edit mode

That link is too old and gives 404

ADD REPLY • link 8.1 years ago by Xapple ▴ 30

0

Entering edit mode

I'm looking for the new URL ... nevertheless I found a Github that had a copy of it. I modified the link accordingly.

ADD REPLY • link 8.1 years ago by Juke34 8.9k

Ram · Answer 1 · 2013-02-08

16

Entering edit mode

11.8 years ago

Istvan Albert 101k

brentp has a nice utility to do just that see https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py

See also this: Guessing the quality scale in FASTQ files

ADD COMMENT • link 11.8 years ago by Istvan Albert 101k

2

Entering edit mode

Thanks, that worked:

gunzip -c file.fastq.gz | awk 'NR % 4 == 0' | head -n 1000000 | python ./guess-encoding.py

ADD REPLY • link 11.8 years ago by 14134125465346445 ★ 3.6k

2

Entering edit mode

note that you can just send -n 100000 as an argument to guess-encoding.py

ADD REPLY • link 11.8 years ago by brentp 24k

0

Entering edit mode

guess-encoding.py needs to be updated

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by Medhat 9.8k

0

Entering edit mode

It seems guess-encoding.py has a misleading example, suggesting cut -f 5 instead of cut -f 11 to grab quality strings.

ADD REPLY • link 6.5 years ago by johnsenkyle13 • 0

0

Entering edit mode

I used the guess-encoding.py on my fastq file and the output is like:

reading qualities from STDIN

Illumina-1.8 55 74

but it is not clear to me how to realise which phred score it is? may you please guide me

ADD REPLY • link 18 months ago by Sara ▴ 30

Ram · Answer 2 · 2015-06-10

head -n 40 file.fastq | \
  awk '{if(NR%4==0) printf("%s",$0);}' | \
  od -A n -t u1 | \
  awk '
    BEGIN
    {
      min=100;
      max=0;
    }

    {
      for(i=1;i<=NF;i++) {
        if($i>max) max=$i;
        if($i<min) min=$i;
      }
    }

    END
    {
      if(max<=74 && min<59) print "Phred+33";
      else if(max>73 && min>=64) print "Phred+64";
      else if(min>=59 && min<64 && max>73) print "Solexa+64";
      else print "Unknown score encoding\!";
    }
    '

source

Ram · Answer 3 · 2013-02-08

If the quality scores contain character 0 it is either Sanger phred+33 or Illumina 1.8+ phred+33. When they also contain the character J, it is Illumina 1.8+ phred 33, otherwise it is Sanger phred + 33.

When the quality scores do not contain 0, it is either Solexa +64, Illumina 1.3+ Phred+64, Illumina 1.5+ Phred+64.

Then it is Solexa +64 when it contains character =

It is Illumina 1.3 phred + 64 when it contains A

It is Illumina 1.5 phred +64 when it contains no A or =

Take a look at the wiki and try to understand the table

Ram · Answer 4 · 2013-02-08

6

Entering edit mode

11.8 years ago

toni ★ 2.2k

You can use this tool: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

It has an internal automatic guesser.

T.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 11.8 years ago by toni ★ 2.2k

0

Entering edit mode

thank you so much. this is so easy for a beginner like me.

ADD REPLY • link 20 months ago by Muhammad • 0

Ram · Answer 5 · 2015-06-10

5

Entering edit mode

9.5 years ago

Brian Bushnell 20k

BBMap as a little tool for this:

$ testformat.sh in=N0174.fq.gz
sanger    fastq    gz    interleaved    150bp

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Hello Brian,

I was doing some tests, with the multiple solutions provided in this post, on this file ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527007/ERR527007_1.fastq.gz, which has an apparently incorrect ASCII encoding, based on BBDuk report.

Surprisingly, only your solution reports the error (i.e. Exception in thread "main" java.lang.AssertionError: ASCII encoding for quality (currently ASCII-33) appears to be wrong)

The rest do report a proper ASCII encoding. I.e:

gunzip -c ERR527007_1.fastq.gz | awk 'NR % 4 == 0' | head -n 100000 | python ./guess-encoding.py

Illumina-1.8 35 74

gunzip -c ERR527007_1.fastq.gz | head -n 40 | awk '{if(NR%4==0) printf("%s",$0);}' | od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding\!";}'

Phred+33

Why does this happen? Should we, in general, be concerned about ACII-encoding issues not being detected on modern Illumina-generated files? I know that this is probably digging into a not-worth problem, as the sequencing file is pretty old and this issue is most likely not portable to more recent sequencing files. I am just asking because I spent quite a bit of time on these files, and I am a bit confused on how different strategies to detect ASCII-encodings report different outputs.

Thank you very much in advance, and sorry for the long question (and for reviving this pretty old post).

ADD REPLY • link 2.4 years ago by JMMM ▴ 10

score 2 · Answer 6 · 2013-02-08

2

Entering edit mode

11.8 years ago

Gvj ▴ 470

If you are searching for a quick dirty method, then just grep for any Sanger or Phred64 unique character. You can find it http://en.wikipedia.org/wiki/FASTQ_format

grep Z filename # for Phred64 and make sure that the lines are not headers

ADD COMMENT • link 11.8 years ago by Gvj ▴ 470

score 2 · Answer 7 · 2016-05-04

As noted by medhat above, GNU od or hexdump can be used to convert the quality scores to their decimal value, so

 cat file.fq | awk 'NR%4==0' | tr -d '\n' | hexdump -v -e'/1 "%u\n"' | sort -nu

will display which (decimal) quality scores exist in your file.

According to brentp's "guess-encoding.py" script the possible ranges are 33-93 (Sanger/Illumina1.8), 64-104 (Illumina1.3 or Illumina1.5) and 59-104 (Solexa). Similarly FastQC assumes that anything with some scores in the 33-63 range is Sanger and that the rest is Illumina1.3-1.5 (it doesn't know about Solexa scores).

score 1 · Answer 8 · 2016-09-09

1

Entering edit mode

8.2 years ago

Shicheng Guo ★ 9.5k

Install BBMap and then use the following script:

Usage:  reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2>

reformat.sh in=Indx07.read1.fq out=Indx07.read1.phred33.fq qin=64 qout=33
reformat.sh in=Indx07.read2.fq out=Indx07.read2.phred33.fq qin=64 qout=33

ADD COMMENT • link 8.2 years ago by Shicheng Guo ★ 9.5k

score 1 · Answer 9 · 2017-10-31

1

Entering edit mode

7.1 years ago

ando.kelli ▴ 60

Hey there, if you run FastQC you can see the quality format in the main output screen, in the section marked "Encoding"

ADD COMMENT • link 7.1 years ago by ando.kelli ▴ 60