My NGS bioinformatics analysis starts with an amplicon FASTQ file (only the R1). In my workflow, I finally created a BAM file. Then, I convert this BAM in CRAM for backup
apptainer exec --bind "$ref_folder":"$ref_folder" "$samtools" samtools view \
-C -T $bwarefgenomepath \
-o ART03_FINAL.cram \
ART03_FINAL.bam
We will backup the CRAM file but I want to create a temporary FASTQ file from the CRAM to check if the original fastq and the new fastq are identical. In the conversion from CRAM to FASTQ, I have some issues.
I have this code:
samtools fastq -T $bwarefgenomepath ART03_FINAL.cram > BACKUP.fastq
gzip BACKUP.fastq
The BACKUP.fastq.gz is created but it is much smaller than the original
103M Oct 17 20:21 ART03_S8_L001_R1_001.fastq.gz
3.7M Oct 19 17:58 BACKUP.fastq.gz
# or before compression
440M Oct 19 17:58 BACKUP.fastq
To do this I use Slurn and I get this in the .log file
Homo_sapiens_assembly38.fasta
[W::fastq_state_set] Bad tag format '/ma'; skipping option
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/cc044cc2256a1141212660fb07b6171e": Network is unreachable
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 1603312 reads
Even if my BACKUP.fast is ok, I am not going to be able to compare the original and the new one via MD5, right??
Both fastq.gz have the same number of read
unless the original cram is sorted on query,
samtools
collate should be used beforesamtools fastq
Beside sorting fastq files' gzip compression ratio might differ too. You can check it with
gzip -l
Also fastq files id lines might differ too. You probably want
-n
to not write the "/1 /2" as well as some way to get the barcodes.This would not work since
-T
is forTAGLIST
parameter, you want to give reference with--reference
.You could try
samtools import
to ingest the FASTQ and output to CRAM in a single step. It also adds a header comment to indicate the appropriatesamtools fastq
command to reverse the process. Eg:Note however this isn't always completely round-trippable. That's a good idea and something we should consider, but it's complicated as FASTQ is such a bad format! It has a myriad of different vaguarities and brokenness to contend with, that it'd need a plethora of extra aux tags to document how to reverse the process. However there is still merit in doing this as a means for removing the FASTQ from local storage and reducing the number of copies of data on disk.
Thanks all for your help! I finally found SamFormatConverter and SamToFastq (both Picard) the right tools to do this. See code here Different number of reads when converting data from FASTQ to BAM and CRAM to FASTQ