The FASTA and FASTQ formats were formed a long time ago when only a small number of sequences were ever available. Now that they are being used as the default format for most sequencing instruments, their limitations become apparent. To determine the number of reads in a FASTQ file, the entire file needs to be read to count the number of newline characters in it. It would be good if these formats kept pace with progress. They could allow a header section like:
# Records: 39810657
It would turn an O(n)
operation into an O(1)
one.
Curious as to why knowing how many reads are in a file is important (without doing any additional operations)?
It's useful for calculating the percentage of mapped reads, when also calculating the same number for a BAM file.
I don't get this: all the reads, even the unmapped, are contained in the bam file.
Often, aligners have an option to output the unmapped reads to a separate FASTA and the BAM contains only mapped reads. If the only data publicly available is the set of FASTQ files and the BAM files containing only the mapped reads, then some calculation is required.
uBAM's (uBAM & metadata - the death of Fastq? ) are a special representation of fastq where all the original data is included as is (i.e. reads present in input fastq are directly converted to bam format, without loss of any information/alignment). Since BAM files allow for additional fields at the beginning one could potentially include any information you want (i.e. total number of reads etc).
Most aligners (
bbmap
being a simple example) will output stats you mention without needing any additional calculation after a standard alignment. You can also easily capture unmapped reads in a separate file.