Forum:Modernising the FASTQ Format
2
0
Entering edit mode
7.9 years ago
dario.garvan ▴ 520

The FASTA and FASTQ formats were formed a long time ago when only a small number of sequences were ever available. Now that they are being used as the default format for most sequencing instruments, their limitations become apparent. To determine the number of reads in a FASTQ file, the entire file needs to be read to count the number of newline characters in it. It would be good if these formats kept pace with progress. They could allow a header section like:

# Records: 39810657

It would turn an O(n) operation into an O(1) one.

FASTA Efficiency FASTQ • 2.2k views
ADD COMMENT
1
Entering edit mode

Curious as to why knowing how many reads are in a file is important (without doing any additional operations)?

ADD REPLY
0
Entering edit mode

It's useful for calculating the percentage of mapped reads, when also calculating the same number for a BAM file.

ADD REPLY
0
Entering edit mode

I don't get this: all the reads, even the unmapped, are contained in the bam file.

ADD REPLY
0
Entering edit mode

Often, aligners have an option to output the unmapped reads to a separate FASTA and the BAM contains only mapped reads. If the only data publicly available is the set of FASTQ files and the BAM files containing only the mapped reads, then some calculation is required.

ADD REPLY
2
Entering edit mode

uBAM's (uBAM & metadata - the death of Fastq? ) are a special representation of fastq where all the original data is included as is (i.e. reads present in input fastq are directly converted to bam format, without loss of any information/alignment). Since BAM files allow for additional fields at the beginning one could potentially include any information you want (i.e. total number of reads etc).

Most aligners (bbmap being a simple example) will output stats you mention without needing any additional calculation after a standard alignment. You can also easily capture unmapped reads in a separate file.

ADD REPLY
6
Entering edit mode
7.9 years ago

Store you fastq in an unsorted BAM instead of fastq ( https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam ) and put whatever you want in the header...

ADD COMMENT
0
Entering edit mode

Once more aligners support unmapped BAM files as input, this would be useful.

ADD REPLY
2
Entering edit mode

no as converting a bam to an interleaved fastq is straightforward and many support fastq on standard input;

bam2fastq your.bam | bwa
ADD REPLY
0
Entering edit mode

Is FastqtoSam the only option to do this?

Brian Bushnell there is streamsam.sh to convert sam/bam to fastq/fasta but nothing to do the reverse, correct?

ADD REPLY
0
Entering edit mode

streamsam.sh or reformat.sh will work (streamsam is faster but only supports sam/bam input). reformat.sh will go from sam, bam, fasta, fastq, scarf, fqz, or oneline (tab-delimited header and sequence) to sam, bam, fasta, fastq, fqz, oneline, or header (sequence headers only). bam support requires samtools to be in the path, though.

ADD REPLY
0
Entering edit mode

Would it make sense to give clumpify.sh the ability to write uBAM files? A 2-for-1, compression plus bam output?

ADD REPLY
0
Entering edit mode

Clumpify can write bam files, it just can't read them (of course, I could change that without much problem). Not sure what happens when it writes temp files though; I'll have to look into that. I don't like "uBAM" as it's a lossy format; you lose the trailing " 1:N:ACATGTA" in the read headers. It also takes more space and is both slower to read and write compared to .fq.gz, when streaming.

ADD REPLY
1
Entering edit mode

. I don't like "uBAM" as it's a lossy format; you lose the trailing " 1:N:ACATGTA".

  • the '1' and the 'N' are saved in the sam flag
  • the sequence index would be saved in the read-group.
ADD REPLY
0
Entering edit mode

Oh so bam writes are already there in clumpify? Since you save the entire header in BBMap alignments can you do the same with clumpify. If that is not uBAM spec compliant then so be it.

To recover the original fastq files you could ask people to use streamsam.sh and then use the points noted by @Pierre below to reconstruct normal illumina headers (if you made the uBAM spec compliant).

BTW: @Heng Li is also against using unaligned BAM's for the same reasons (overhead) in the thread I had linked above.

ADD REPLY
3
Entering edit mode
7.9 years ago
d-cameron ★ 2.9k

It sounds like you are advocating something like storing sequencing data in HDF5 https://www.hdfgroup.org/genomics-2/

As with any "standard", moving to different ones take a lot of time and is generally only done when there is a compelling reason for it. Your particular example of:

# Records: 39810657

is fine if people only ever wanted to count reads, but it comes at a cost. In this particular case, you've just massively increased the cost and complexity of filtering your FASTQ+ file as you need to keep the header in sync with the rest of the file. You have also introduced a whole class of data inconsistency errors that weren't there before. You can't use sed/awk/grep anymore as that breaks the header. Your change also means you can't pipe the file between programs: if you are streaming the file through a pipe you have to write the header first which means you have to know how many records you have which means to you have to cache the entire file which defeats the purpose of streaming it in the first place.

Even if that was part of a FASTQ+ specifications, I wouldn't trust that every file in the pipeline actually respected the specifications so your operation which was in theory O(1) would be turn out to be O(n) in practice anyway.

The SAM/VCF specifications are good examples of the mess and proliferation of edge cases that can be inadvertently introduced by adding additional structure to a simple format. They look fine on the surface, but as soon as you start trying to write generic software you find that there's a whole lot more edge cases that need to be handled than with FASTA/Q.

ADD COMMENT
0
Entering edit mode

For example, the following are valid VCFv4.1 constructs that I'm pretty sure your favourite variant calling pipeline doesn't correctly handle:

  • U+0000 (aka \0, or C string null terminator) as part of the input file
  • Mixed CR (unix-style) and CR+LR (windows-style) line terminators (or even RS, CR, 0x9B, LF+CR)
  • A[<DEL>[, or even C[<=>:-0[ as ALT alleles (yes, your ALT alleles can contain a smiley face)
  • numerical values such as "123.456.789,00" (aka 123,456,789.00 in an EN-US locale)
  • percent encoded newlines INFO strings
  • newline character in a quote-escaped section of header line

And there's a whole lot more - https://github.com/samtools/hts-specs/issues

ADD REPLY

Login before adding your answer.

Traffic: 4433 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6