Question

Reasonable Assumptions About Fastq File Integrity

0

Entering edit mode

13.1 years ago

Alex Reynolds 36k

Can I assume that the genomic sequences and quality sequences in a FASTQ file will be of the same length — not only within a read, but through the entire file, for all reads?

For example, here are a few reads from a sample file:

@IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^
@IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
+IRIS:7:1:17:800#0/1
ababbaaabaaaaa`]`ba`]`aaaaYD\\_a``XT
@IRIS:7:1:17:1757#0/1
TTTTCTCGACGATTTCCACTCCTGGTCNACGAATCC
+IRIS:7:1:17:1757#0/1
aaaaaa``aaa`aaaa_^a```]][Z[DY^XYV^_Y
...

Can I assume the file (or read) is bad, if the read has a shorter genomic and/or quality sequence, e.g. the second read in this example:

@IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^
@IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATA
+IRIS:7:1:17:800#0/1
ababbaaabaaaaa`]`ba`]`
@IRIS:7:1:17:1757#0/1
TTTTCTCGACGATTTCCACTCCTGGTCNACGAATCC
+IRIS:7:1:17:1757#0/1
aaaaaa``aaa`aaaa_^a```]][Z[DY^XYV^_Y
...

Or can a FASTQ file deliberately contain reads (and quality strings) of variable lengths?

fastq filter fastq data • 2.5k views

ADD COMMENT • link updated 13.1 years ago by Istvan Albert 102k • written 13.1 years ago by Alex Reynolds 36k

score 6 · Answer 1 · 2011-11-08

6

Entering edit mode

13.1 years ago

Istvan Albert 102k

The FASTQ standard requires that for any record the length of the sequence line (line 2) must match the length of the quality line (4).

While instruments usually produce identical sequence lengths for all records this cannot be assumed to be so for all fastq files. For example quality trimming may be applied that could chop off bases from the beginning or end of sequences.

ADD COMMENT • link 13.1 years ago by Istvan Albert 102k

1

Entering edit mode

For example, Ion Torrent produces FastQ files with reads of variable length

ADD REPLY • link 13.1 years ago by Andreas ★ 2.5k

0

Entering edit mode

Darn. I knew that the sequence and quality strings need to be of identical length, but I was hoping I could get away with reads of same length across the entire file. Thanks to you both for your answers.

ADD REPLY • link 13.1 years ago by Alex Reynolds 36k