I am looking for efficient sequencing data storage solution. Based on this data, I believe that I could save a lot of space if I compress my fastq or bam files into CRAM (loseless).
While CRAM is indeed much smaller than BAM, its primary benefit is when using aligned data so you can use a reference sequence (whether external or embedded). It does however still work without a reference.
I wouldn't recommend creating a fake reference just to get it to swallow things. You shouldn't need -T at all as with unaligned data there are no references to compare against anyway. Also note for aligned data, if you really need to, there is a way to enable referenceless encoding using "--output-fmt-option no_ref=1", although it's not going to be hugely beneficial.
Incidentally for FASTQ compression things have hotted up in recent years and there are far better tools out there, albeit due to doing mini denovo-assemblies (either by bloom-filter, graphs, or kmer counting strategies). They're often very CPU and memory hungry, but obviously yield far smaller files than CRAM as they're essentially doing reference-based compression with the reference computed on the fly.
FaStore, Spring and FQSqueezer are modern tools for this process.
I just noticed that the -T reference in the manual has now been changed to:
-T FILE
A FASTA format reference FILE, optionally compressed by bgzip and ideally indexed by samtools faidx. If an index is not present, one will be generated for you
CRAM is indeed smaller in size than BAM due to the superior compression and according to James Bonfield (https://www.sanger.ac.uk/people/directory/bonfield-james) it even gets smaller once an alignment is present, see his response to that tweet: We typically get sequencing data in uBAM format and the facility uses fastqtosam from picard if that helps you. I would check if one can send the output to stdout (probably one can) and then simply pipe it into samtools view like fastqtosam (...) | samtools view -T ref.fa -o out.cram. Still, writing CRAM is pretty slow (have not benchmarked but it really takes a while, notably slower than BAM) so I personally only use it for storage purposes, mainly of the uBAM raw data.
Edit:
By the way because we just had this discussion in the slack, based on my short testing it is irrelevant what sequence you provide as -T, so if you don't have a reference you can just use any random fasta (even if it has just one chromosome with 1bp) and the resulting CRAM should be the exact same in terms of size.
Edit2:
As jkbonfield says, when compressing aligned BAM the compression (for me) typically reduces the file size to roughly 30% of the original BAM, quite impressive and useful for storage purposes such as long time archiving.
This would be a very nice tutorial. I've used the tool SPRING to convert fastq directly and written helper scripts for this, but I can't bring myself to commit all my institutions FASTQs to this and delete the originals. Why ? Because Spring is just the work of one talented developer who will likely not be able to maintain it forever.
CRAM on the other hand is an international standard, even if it has taken ~10(?) years to start taking off. I would feel a lot safer using this, especially if reinstating the BAM/FASTQ is not completely dependent on the reference (another big risk factor ... ).
While some programs require uncompressed .fastq files, many/most will accept .fastq.gz files.
Creating .fastq.gz files will considerably save space for long-term storage.
I'm not sure how this compares to CRAM, but it has considerable storage savings with greater functionality (since very few programs will accept a .cram file as an input for an alignment)
Thank you for pointing out the .fastq versus .fastq.gz discussion.
.cram functionality is kind of important for both reads (which you are discussing) as well as alignment (since I believe there are programs that don't accept .cram as an input). However, the feedback about the reference being important for the compression is good for other people to know about (and .fastq.gz is not relevant for discussions of .sam versus .bam versus .cram).
I would say 'no' because CRAM needs a reference genome to store its' bases, hence the reads need to be mapped.what ATpoint said :-)