Is there an easy way to guess the scale, given a sufficiently large FASTQ file?
The best would be some working code that I could learn from. However, both BioPerl and BioPython appear not to contain guessing code.
Is there an easy way to guess the scale, given a sufficiently large FASTQ file?
The best would be some working code that I could learn from. However, both BioPerl and BioPython appear not to contain guessing code.
You read the biopython code here? That's the best explanation of the quality scores I've seen.
There's also a nice text-graphic about 2/3rd's of the way down the wikipedia page
Finally, FastQC guesses the encoding of your quality scores, so you could look at the java code.
Here is a Perl script for guessing the quality scale
https://www.uppnex.uu.se/content/check-fastq-quality-score-format
Here is the new link for this Perl tool: http://www.uppmax.uu.se/userscript/check-fastq-quality-score-format
It has been improved recently.
-- update --
You can find it in this repository, under this name fastq_guessMyFormat.pl
: https://github.com/NBISweden/GAAS/tree/master/annotation/Tools/Util
Here is a link to download it directly.
Does the FAST-X toolkit answer your needs ? http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastq_quality_boxplot_usage
I wrote a Python-based FASTQ quality guesser: https://github.com/DarwinAwardWinner/fastqident It uses BioPython's FASTQ parser, so it will work on anything that is parsable by BioPython.
The placsupport module can be found at https://github.com/DarwinAwardWinner/placsupport
Isn't that solving the wrong problem? The guessing code in FastQC looks fragile, it simply looks at the smallest code used for qualities, so it depends on actually seeing low quality bases.
I believe you should get the correct encoding from extra knowledge (i.e. knowing which version of which program generated the file, say from some log file), and then convert to a well specified format (e.g. BAM) once. Please don't perpetuate the practive of guessing at the details underspecified formats.
In addition to Ryan, I have a python based fastq quality guesser as well if you would like to use it. It is just standard python (no biopython). PM if interested.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks, BioPython does not hav guessing code, though, right? FastQC just looks at the lowest seen quality. I guess that's most promising, then, maybe augmented by checking an upper limit, too.