Question

Correct, Robust Fastq Parsing?

9

Entering edit mode

13.7 years ago

Ketil 4.1k

Hi

I'm currently using samtools.pl to generate consensus and quality from a resequencing effort. Like so many other Fastq parsers, I made the assumption of a four-line format, but samtools produces a fastq format where sequence and quality runs over multiple lines.

I could of course extend the parser to accomodate this format, but this seems pretty hard to get correct, since quality information may contain both + and @ as valid characters.

As I see it, the correct way to do it is to read sequence until a line with either a single '+' OR a '+' followed by the same read name as was used in the '@' line. And then read the same number of quality values.

I think this might work, but as there are multiple other Fastq parser implementation, I'm curious how these deal with this issue?

fastq parsing • 4.9k views

ADD COMMENT • link updated 13.7 years ago by biobot 0.0.77.a.1099 6.2k • written 13.7 years ago by Ketil 4.1k

score 6 · Answer 1 · 2011-04-06

6

Entering edit mode

13.7 years ago

Peter 6.0k

In general the quality string can contain both the @ and + characters, and if you allow line wrapping this means any quality line may start with those characters. The only way to be safe is to track the length of the sequence, and thus know how many quality characters there should be. This also has implications for quickly counting the number of reads, and for indexing.

Please read this paper http://dx.doi.org/10.1093/nar/gkp1137 and test your code with the provided example files which are used in the Biopython, BioPerl, BioJava, BioRuby and EMBOSS test suites.

ADD COMMENT • link 13.7 years ago by Peter 6.0k

0

Entering edit mode

Pedantic note: The paper says that in the sequence data "there is no explicit limitation on the characters expected". I don't think it's possible (and certainly not practical) to parse FastQ files unambigously unless you at least prohibit '+'. OTOH, "a gap character" is explicitly allowed, and the only thing not allowed is whitespace other than newline.

ADD REPLY • link 13.7 years ago by Ketil 4.1k

score 3 · Answer 2 · 2011-04-06

3

Entering edit mode

13.7 years ago

brentp 24k

You can see how biopython handles this here

Check the docstring (just below the linked line) for some of the worst-cases you'd have to account for if you re-implemented yourself.

ADD COMMENT • link 13.7 years ago by brentp 24k

0

Entering edit mode

Looks fairly complete. It fails on empty reads (e.g. "@foon+n"), and nameless reads (@nACGTn+nBCDE") and doesn't treat "r" as whitespace (i.e. it is silently ignored), but these are arguably not illegal or actual limitations.

ADD REPLY • link 13.7 years ago by Ketil 4.1k

score 2 · Answer 3 · 2011-04-06

2

Entering edit mode

13.7 years ago

biobot 0.0.77.a.1099 6.2k

That's exactly how I parse Fastq; count the sequence characters, count the quality characters. I also catch pathological cases where the identity is an empty string. This is technically legal because the Fastq paper explicitly states that there is no length limit on the title field. The authors probably meant no upper length limit only.

The Fastq paper is a high-visibility source that describes the format as

@title and optional description
sequence line(s)
+optional repeat of title line
quality line(s)

which means a lot of people will expect general-purpose Fastq parsers to handle this by default.

ADD COMMENT • link 13.7 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Good point. Leaving out the identifier is a common compression trick - probably about to be explicitly allowed for SAM/BAM non-paired data. It makes sense to allow this in FASTQ (and FASTA and QUAL) too - I'd have to check my parser copes.

ADD REPLY • link 13.7 years ago by Peter 6.0k

Ram · Answer 4 · 2011-04-06

Edit: there is a Java parser for FASTQ in the picard library:

http://picard-tools.sourcearchive.com/documentation/1.25-1/FastqReader_8java-source.html

but as far as I can see, it also expects four lines.

I wrote a fastq parser, one day. I don't even remember if I used it :-)

It works as an iterator :

import java.io.BufferedReader;
import java.io.IOException;

/**
 * FastQReader
 * @author Pierre Lindenbaum
 *
 */
public class FastQReader
    {
    private BufferedReader in;
    private String prev_line=null;
    private String name=null;
    private StringBuilder sequence=new StringBuilder();
    private StringBuilder quality=new StringBuilder();
    private long nLine=0L;

    public FastQReader(BufferedReader in)
        {
        this.in=in;
        }

    public String toFasQ()
        {
        return getName()+"\n"+getSequence()+"\n+\n"+getQuality();
        }

    private String _readLine() throws IOException
        {
        if(prev_line!=null)
            {
            String s=prev_line;
            prev_line=null;
            return s;
            }
        String s=null;
        while((s=in.readLine())!=null)
            {
            ++nLine;
            if(s.trim().isEmpty())
                {
                continue;
                }
            return s;
            }
        return null;
        }

    public boolean next() throws IOException
        {
        name=null;
        sequence.setLength(0);
        quality.setLength(0);
        String line=_readLine();
        if(line==null) return false;
        if(!line.startsWith("@") ) throw new IOException("expected a line starting with @ but found "+line);
        name=line.substring(1).trim();
        while(true)
            {
            line=_readLine();
            if(line==null) throw new IOException("bad sequence for @"+name);
            if(line.startsWith("+"))
                {
                break;
                }
            sequence.append(line);
            }
        while(true)
            {
            line=_readLine();
            if(line==null) //last line ?
                {
                if(quality.length()!=sequence.length())
                    {
                    throw new IOException("bad quality for @"+name +" next line is null and sequence:\n"+
                            sequence +"("+sequence.length()+") and quality:\n"+quality+"("+quality.length()+")");
                    }
                break;
                }
            if(line.startsWith("@") &&
                quality.length()==sequence.length()) //quality can start with '@'
                {
                this.prev_line=line;
                break;
                }
            quality.append(line);
            }
        if(quality.length()!=sequence.length())
            {
            throw new IOException(
                    "length(quality)!=length(dna) for @"+name+"\nseq:"+sequence+"("+sequence.length()+")\nqual:"+quality+"("+quality.length()+")");
            }
        return true;
        }

    public String getName() {
        return name;
        }

    public StringBuilder getSequence()
        {
        return sequence;
        }

    public StringBuilder getQuality() {
        return quality;
        }

    public long getLine() {
        return nLine;
        }

    @Override
    public boolean equals(Object obj) {
        return obj==this;
        }

    @Override
    public String toString()
        {
        return getClass().getName()+":"+getLine();
        }
    }

score 1 · Answer 5 · 2011-04-06

1

Entering edit mode

13.7 years ago

Martin A Hansen 3.0k

Wrapping of sequence, qualities, etc is a pain. I guess people insist on wrapping sequences for historical reason because in the old days they wanted a "human" readable sequence that printed nicely on paper so they could mark up their primers with a pencil.

In the days of NGS - are there ANY valid reasons for wrapping sequences and qualities?

ADD COMMENT • link 13.7 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Human readability is always crucial. It is utmost important to "feel" the data rather than simply let programs to read it.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

Human readability is always crucial. It is utmost important to "feel" the data rather than simply let programs do that for you.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

So you think that a string of "ffedeee^eeeeedeeeeeee" is human readable? :o). I agree that it is important to get a "feel" for the data but the times of one page GenBank files is long gone ...

ADD REPLY • link 13.6 years ago by Martin A Hansen 3.0k

0

Entering edit mode

I mean the sequence should be readable, not quality.

ADD REPLY • link 13.3 years ago by lh3 33k

0

Entering edit mode

If I use cat or less on a single-line sequence my terminal will display it perfectly. If you need a to view the sequenced wrapped - then wrap the sequence for that purpose only.

ADD REPLY • link 13.3 years ago by Martin A Hansen 3.0k