Hi
I'm currently using samtools.pl to generate consensus and quality from a resequencing effort. Like so many other Fastq parsers, I made the assumption of a four-line format, but samtools produces a fastq format where sequence and quality runs over multiple lines.
I could of course extend the parser to accomodate this format, but this seems pretty hard to get correct, since quality information may contain both + and @ as valid characters.
As I see it, the correct way to do it is to read sequence until a line with either a single '+' OR a '+' followed by the same read name as was used in the '@' line. And then read the same number of quality values.
I think this might work, but as there are multiple other Fastq parser implementation, I'm curious how these deal with this issue?
Pedantic note: The paper says that in the sequence data "there is no explicit limitation on the characters expected". I don't think it's possible (and certainly not practical) to parse FastQ files unambigously unless you at least prohibit '+'. OTOH, "a gap character" is explicitly allowed, and the only thing not allowed is whitespace other than newline.