What you might be looking for is uBAM - This is used in GATK's production pipeline to attach metadata to samples as early as possible, however the Broad's scale of data that they're analysing is far more than most places. The biggest caveat to this approach is outlined by Brian Bushnell (Author of bbmap), on that thread, in that gzipped fastq is much more practical in terms of (de)compression efficiency and resulting storage footprint, not to mention that by forcing your reads to the SAM spec, you may be losing original read names, and that can cause problems with tools that assume the original illumina read names (ie, paired end data could be parsed as single end).
I was thinking along the same lines as you a few years ago, and opted against uBAM and stuck with gzipped fastq, however saying that, I recall a thread about loseless fastq compression here. The tool (Alapy) that @Petr Ponomarenko outlined is available here, and there's a fantastic discussion on the thread of benchmarking and feature requests, worth a read.
Additional methodologies out there are binary fastq as @chris.bird mentioned:
- @John's uQ tool
- @Brian Bushnell's Clumpify tool (from bbmap)
Overall, my advice is to avoid uBAM, and check out lossless compression methods like alapy, uQ, or Clumpify, and see what best fits your needs, as there will be tradeoffs in terms of memory usage, and time to execute per file.
It is not the right question - BWA produces a SAM format - so the question should be "Is the sequence information in a SAM format identical to the FASTQ sequence" - to which the answer is NO. The SAM format represents information on the forward strand, hence sequences mapping on the reverse strand will be reverse complemented in the SAM format.
Perhaps the question is whether the FASTQ information could always be reconstructed from the SAM file - to which is the answer is again that not always. If the alignment is reported hard clipped then the clipped sequence will not be in the SAM file.
So why don't you archive fastq in the first place? I don't see the reason to align and then discard the work the aligner has done. If you really want to store unaligned CRAM instead of fastq then you can convert fastq to unaligned SAM with e.g. picard/FastqToSam.
In fact, he can't do what he wants, because CRAM is an alignment format, and the compression is based on the alignment to the reference genome. If he discards everything but the name, sequence and quality, he can't save as CRAM.
Yes very good, I also need to keep the alignment, and the 'which pair-end' metadata, and probably some more I have forgotten.