Each read in a fastq file makes up 4 consecutive lines (read id, read sequence, qual id and qual string). What do you need the qual id for? Isn't the read id enough for identification? Besides, in most fastq files (if not all?) the qual id is just "+"
I personally like that there are 4 lines as I prefer dividing by 4 to dividing by 3. If I grab first 100/1000/10,000 lines I immediately know how many sequences are there. Or maybe I am just that used to 4 lines that I cannot step back and admit that 3 would be better:))
There's no reason I know of offhand that fastqs can't be completely supplemented by bam files. Put all your reads in there (marked unmapped), and you've got everything you need. Many aligners can use bam as input these days.
the BAM SEQ column will display sequences as they align on the forward strand, so sequences aligned on the reverse strand would need to be reverse complemented to obtain the actual data. In addition hard clipping is a valid alignment representation, but that also means that the some of the original information is lost. Then if we consider spliced alignments getting back the original data is probably even more convoluted especially since there is also read pairing to keep track of.
Picard SamToFastq will take care of the strandedness problem if you need to recreate the original fastqs. If all you're doing is storing raw reads, the hard clipping info won't be an issue either (some would argue that you shouldn't be doing hard clipping anyway). Yes, I agree that the spliced alignments would be a pain, but not intractable - just need one brave soul to write the tool so the rest of us can use it :)
The FASTQ format was invented at the turn of the century at the
Wellcome Trust Sanger Institute by Jim Mullikin, gradually
disseminated, but never formally documented (Antony V. Cox, Sanger
Institute, personal communication 2009).
so we can't be all that surprised that the format has some unspecified characteristics.
I think it's there to serve as a delimiter. One of the most common issues we have to deal with is the problem of line endings caused by going from different operating systems. When Fasta files get messed up because of this you can at least tell where the sequence ends because the header starts with the greater-than sign. It would be chaotic if not for that delimiter. Likewise, I think it would be more difficult trying to find where the sequence ended and the quality line started without this '+' delimiter.
Yup it would be better to have sth like:
Read_ID \t read_sequence \t qual_sequence
I personally like that there are 4 lines as I prefer dividing by 4 to dividing by 3. If I grab first 100/1000/10,000 lines I immediately know how many sequences are there. Or maybe I am just that used to 4 lines that I cannot step back and admit that 3 would be better:))
There's no reason I know of offhand that fastqs can't be completely supplemented by bam files. Put all your reads in there (marked unmapped), and you've got everything you need. Many aligners can use bam as input these days.
alas there is always another catch
the BAM SEQ column will display sequences as they align on the forward strand, so sequences aligned on the reverse strand would need to be reverse complemented to obtain the actual data. In addition hard clipping is a valid alignment representation, but that also means that the some of the original information is lost. Then if we consider spliced alignments getting back the original data is probably even more convoluted especially since there is also read pairing to keep track of.
Picard SamToFastq will take care of the strandedness problem if you need to recreate the original fastqs. If all you're doing is storing raw reads, the hard clipping info won't be an issue either (some would argue that you shouldn't be doing hard clipping anyway). Yes, I agree that the spliced alignments would be a pain, but not intractable - just need one brave soul to write the tool so the rest of us can use it :)