I would like to know if it's possible to produce uncompressed FastQ
files from raw data (BCL
files) using Illumina's sofware bcl2fastq2
.
From the documentation bcl2fastq2 v.2.20, I saw the following:
--no-bgzf-compression: Turn off BGZF and use GZIP to compress FASTQ files. BGZF compression allows downstream applications to decompress in parallel. This option is available for FASTQ data consumers that cannot handle standard GZIP formats.
However, this only turns off BGZF
compression but still produces *.gz
files
Anyone know if there's a way to produce uncompressed FastQ
files with this software ?
Otherwise, could you suggest other softwares (other than Picard IlluminaBasecallsToFastq) that would perform such operations using the RunInfo.xml
and other files found in the raw folder produced by sequencing ?
The reason is that the gzipped FastQ
produced uses the deflate algorithm and because of that, unzipping huge data even with a software like pigz
takes a lot of them since it will use only one thread. Imagine having 100 samples zipped, it would take like 4 days to unzip all of them.
The idea would be to use a more performing compression tool like lbzip2
which allows multithreading compression/decompression with size even lower than bgzipped files.
but you can do this in parallel.
but most softwares like 'bwa' can read gzip files.
Yeah, it's possible to do it in parallel. I have a cluster for that processes these files 15 by 15 but if the files are too big, even paralleling the samples would take time.
The issue is not about reading gzip files. When sequencing lots of data, they take too much space on the storage and this becomes costly in the long run. So, I am looking for a way to store the data in a better way. For this, I only want to produce uncompressed FastQ from the BCL files
That is a different issue then. How are you planning to use uncompressed fastq files to save space?
There are other ways of compressing fastq files (e.g. storing them as unaligned BAM/CRAM or using alternate fastq compression techniques like SPRING). I am not sure it is worth your time to mess with primary data (which should really be backed up as is). If you have the desire/time to do this then you can certainly pursue that path.
Thanks for the SPRING link. That's useful !
I explained before:
I am not altering the primary data in any way ... just optimizing the compression.
I will compress the
FastQ
files withlbzip2
which is more powerful tool for compression/decompression. What I am doing now is oncebcl2fastq2
outputsFastq.gz
, I decompress them and then compress them withlbzip2
.ok, I see.
Why do you want this?
Most bioinformatic tools can read von (b)gzip'ed files. And all of them - which provide random access - expect that it is bgzip. Otherwise no random access is possible.
In the lab, we do lots of sequencing that produces very huge raw data, for example
1.1T
. I have a sequencing analysis pipeline that does everything and ending up with5.6T
for the same size example given. Using different compression methods, you could save lots of space and time mostly.Consider (or convince your supervisor) the need for purchasing additional storage/backup as cost of doing business. If you are doing lot of sequencing (which means you are buying reagents that are rather expensive) then plan to add more backup/storage space alongside.
Thank you for your advice.
We have enough of storage. But it wouldn't be wise though to keep upgrading storage and spending money when we can find solutions to optimize compression, save money and time.
Storage, money and time - pick two to save. You cannot save all three. No matter how optimized your compression, it will not stop your storage requirement from ballooning. Compression algorithms can save space, not time.
I am wondering if there is a way to have bcl2fastq output uncompressed fastq files and I am not seeing that addressed in the answers and comments here.
Is this possible to go directly to uncompressed fastq files using bcl2fastq?
Some parts of my pipeline only work on uncompressed fastq files. Starting with uncompressed fastq would save hours off my process.
Yes there is. Use following option.