Question

How Common Are Multi-Line Fastq Files?

4

Entering edit mode

13.6 years ago

ALchEmiXt ★ 1.9k

The general format of a fastq file is (sequence block):

@SEQHEADERandmorestuff AGCTGTGTATGTGTGTGSTGTSGTTACGTGTATCGATCGCTGCTA... +SEQHEADERandmorestuff <--optional header but required + 00300QUALITYSCORESINILLUMINAORSANGERFORMAT.....

(next sequence block)

Since the + and the @ can also be used as quality scores they are not great regexp grep patterns ^+ or ^@. As well as that the headers seem to be used freely as well. Most files seem to stick to the 4 line (defined by \n) structure though and can be easily parsed or split. In the past I used to use a perl regexp patterns that turned out to be quite unique for the header but that is not failsafe:

$count++ if /^@[A-Za-z0-9_.:#\-\/]+\n/; # the hash is a valid char!

However the fastq format is quite loose defined also regarding the different used quality scoring systems :(. Even the "official wiki" tells that wrapped/MULTIPLE line fastq formats are valid.

My question: how common are these wrapped/multi-line files and which tools would generally generate such output? Or is it safe to assume for a new tool that it sticks to the 4 line convention?

fastq parsing • 8.4k views

ADD COMMENT • link 13.6 years ago by ALchEmiXt ★ 1.9k

3

Entering edit mode

We have had this discussion earlier: Split Fastq Files Into Chunks Of 1M Reads , but after that I don't dare to give you an answer any more ;)

In short, both viewpoints are correct: given that 'official' definition, multiline fastq is valid. on the other hand, nobody has ever shown me a fastq file like this (of course I can make one). But how authoritative is the 'official' 'standard' anyway (this is not XML), given there are at least 3 standards for encoding the qualities. thus reading fastq is never safe...

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 13.6 years ago by Michael 55k

1

Entering edit mode

And your question is maybe related to parsing fastq, so look here.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 13.6 years ago by Michael 55k

0

Entering edit mode

Your expression should be okay. If ^@ matches it should never be in a quality line.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 13.6 years ago by Fabian Bull ★ 1.3k

0

Entering edit mode

Have looked over the earlier examples. Missed the one on sam tools though..... The '@' is a quality score so just using ^Q in this example will fail (or not?):

@ILLUMINAtesttest1:2:23
actgtgatcgatcgatcgatcgggtag
+ILLUMINAtesttest1:2:23
@!!!!(!!!!!!!????????BBBEEE

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 13.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Why should it fail? Quality-Scores start with a '+' so ^@ will never match such lines.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 13.6 years ago by Fabian Bull ★ 1.3k

0

Entering edit mode

I think you are right when the 3rd line is starting with + and the 4th with @ you should be able to detect that. But just counting by greplines starting with @ fails. The first position of the qualityscore LINE (not the HEADER) can be @ since its a valid quality score character. Then it fails and have to fallback on either previous lines to see context, within-block line number etc...grep (as well as many oher perl parsers preventing everything getting in memory) will not do multi line...

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 13.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

If you're using a perl script, why not have a regexp specifying that the header column must start with an @ or +, followed by less (or more) characters than you have in read length. Something like /^@d{1,read_length - 1}/ or /^@d{read_length + 1,}/. It may be likely that a quality line starts with an @ or +, but how likely is it to also be the exact same length as the header line?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 13.6 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

Mitch I agree that is another good addition since readlength can be determined beforehand or in mixed files on basis of counts. Like the almost unique regexp I posted above. Anyway never 100% sure and the quick grep -c is still not possible. Furthermore, I can't oversee what happens in a wrapped sequence/quality score section...

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 13.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Anymore tools that have this behaviour?

I marked the samtools as the answer but would like to keep the question open for more suggestions and as a future reference...

ADD REPLY • link 13.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Are you asking because you're writing some sort of generic fastq parser? Then my suggestion would be to write a specific parser instead and convert your data into a sane format (BAM?) exactly once. BAM is much more versatile and very grepable (after on-the-fly conversion to SAM).

ADD REPLY • link 13.6 years ago by Marvin ▴ 900

0

Entering edit mode

@Marvin. No and yes. Actually I was forking/paralleling some tools which need large fastqs to be split. And of course multi line fastqs make it troublesome. Anyway, I now make it just a prerequisite that its not multilined. Or investigate the seqtk stuff.

ADD REPLY • link 13.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

The question/problem has dissolved over time as of Ilumina data casave 1.8+ is in Sanger scoring (no grooming necessary) BUT MORE IMPORTANTLY the headers have changed! They now contain a SPACE which makes them UNIQUE and easy to recognize with regular expressions starting lines ^@ or ^+ and a space in the line.

An additional method would have been the use of quality scores above what is possible. Previously the quality scores could reach up to the ASCII h (104) now in Sanger only till J (74) so another option for parsing. See the pasted table below from the wiki; This option is dangerous though since we cannot be sure a header WILL contain one of these characters... not by default.

Old:[?] [?][?][?][?][?][?][?][?][?]@HWUSI-EAS100R:6:73:941:1973#0/1

NEW:[?] [?][?][?][?][?][?][?][?][?]@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG [?] [?]

Quality scoring ref table: [?] SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) [?]

[?]

See the updated wiki http://en.wikipedia.org/wiki/FASTQ_format

ADD REPLY • link 13.5 years ago by ALchEmiXt ★ 1.9k

Ram · Answer 1 · 2011-09-08

8

Entering edit mode

13.6 years ago

Casey Bergman 18k

SAMtools generates multi-line fastq when generating a consensus sequence, eg.

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq

This is the basis of a question I asked a while ago.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.6 years ago by Casey Bergman 18k

1

Entering edit mode

The could be an enpoint for some uses, but we use them transiently to get fasta for downstream analysis. you can use lh3's seqtk to process multi-line fastq. See his answer here.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.6 years ago by Casey Bergman 18k

0

Entering edit mode

So how does the postprocessing deal with this multi-line samtools output?

ADD REPLY • link 13.6 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Sorry, I don't fully understand what you are asking?

ADD REPLY • link 13.6 years ago by Casey Bergman 18k

0

Entering edit mode

@Casey. Sorry for not being clear; I mean the samtools pileup generates these multi-line fastq files. These fastq files aren't the endpoint right? So which tools are used subsequently since they seem to be able to deal with these wrapped fastq correctly!?