Is there a simple tool I can use to quickly find out if a FASTQ file is in Sanger or Phred64 encoding? Ideally something that tells me 'Encoding XX' somewhere the terminal output.
Is there a simple tool I can use to quickly find out if a FASTQ file is in Sanger or Phred64 encoding? Ideally something that tells me 'Encoding XX' somewhere the terminal output.
brentp has a nice utility to do just that see https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py
See also this: Guessing the quality scale in FASTQ files
head -n 40 file.fastq | \
awk '{if(NR%4==0) printf("%s",$0);}' | \
od -A n -t u1 | \
awk '
BEGIN
{
min=100;
max=0;
}
{
for(i=1;i<=NF;i++) {
if($i>max) max=$i;
if($i<min) min=$i;
}
}
END
{
if(max<=74 && min<59) print "Phred+33";
else if(max>73 && min>=64) print "Phred+64";
else if(min>=59 && min<64 && max>73) print "Solexa+64";
else print "Unknown score encoding\!";
}
'
If the quality scores contain character 0 it is either Sanger phred+33 or Illumina 1.8+ phred+33. When they also contain the character J, it is Illumina 1.8+ phred 33, otherwise it is Sanger phred + 33.
When the quality scores do not contain 0, it is either Solexa +64, Illumina 1.3+ Phred+64, Illumina 1.5+ Phred+64.
Then it is Solexa +64 when it contains character =
It is Illumina 1.3 phred + 64 when it contains A
It is Illumina 1.5 phred +64 when it contains no A or =
Take a look at the wiki and try to understand the table
You can use this tool: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
It has an internal automatic guesser.
T.
BBMap as a little tool for this:
$ testformat.sh in=N0174.fq.gz
sanger fastq gz interleaved 150bp
Hello Brian,
I was doing some tests, with the multiple solutions provided in this post, on this file ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527007/ERR527007_1.fastq.gz, which has an apparently incorrect ASCII encoding, based on BBDuk
report.
Surprisingly, only your solution reports the error (i.e. Exception in thread "main" java.lang.AssertionError: ASCII encoding for quality (currently ASCII-33) appears to be wrong
)
The rest do report a proper ASCII encoding. I.e:
gunzip -c ERR527007_1.fastq.gz | awk 'NR % 4 == 0' | head -n 100000 | python ./guess-encoding.py
Illumina-1.8 35 74
gunzip -c ERR527007_1.fastq.gz | head -n 40 | awk '{if(NR%4==0) printf("%s",$0);}' | od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding\!";}'
Phred+33
Why does this happen? Should we, in general, be concerned about ACII-encoding issues not being detected on modern Illumina-generated files? I know that this is probably digging into a not-worth problem, as the sequencing file is pretty old and this issue is most likely not portable to more recent sequencing files. I am just asking because I spent quite a bit of time on these files, and I am a bit confused on how different strategies to detect ASCII-encodings report different outputs.
Thank you very much in advance, and sorry for the long question (and for reviving this pretty old post).
If you are searching for a quick dirty method, then just grep for any Sanger or Phred64 unique character. You can find it http://en.wikipedia.org/wiki/FASTQ_format
grep Z filename # for Phred64 and make sure that the lines are not headers
As noted by medhat above, GNU od or hexdump can be used to convert the quality scores to their decimal value, so
cat file.fq | awk 'NR%4==0' | tr -d '\n' | hexdump -v -e'/1 "%u\n"' | sort -nu
will display which (decimal) quality scores exist in your file.
According to brentp's "guess-encoding.py" script the possible ranges are 33-93 (Sanger/Illumina1.8), 64-104 (Illumina1.3 or Illumina1.5) and 59-104 (Solexa). Similarly FastQC assumes that anything with some scores in the 33-63 range is Sanger and that the rest is Illumina1.3-1.5 (it doesn't know about Solexa scores).
Install BBMap and then use the following script:
Usage: reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2>
reformat.sh in=Indx07.read1.fq out=Indx07.read1.phred33.fq qin=64 qout=33
reformat.sh in=Indx07.read2.fq out=Indx07.read2.phred33.fq qin=64 qout=33
Hey there, if you run FastQC you can see the quality format in the main output screen, in the section marked "Encoding"
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The tool FastQC has a good guesser. Or use the following perl script: fastqFormatDetect.pl
Both base their results according to the characters encountered within the score line of the fastq file. It's well explained above or on the fastq wiki page.
That link is too old and gives 404
I'm looking for the new URL ... nevertheless I found a Github that had a copy of it. I modified the link accordingly.