This post is following up on this other question FASTQ to BAM to CRAM to FASTQ.
I have developed an NGS pipeline for calling variants from amplicon data. Regarding the backup, we want to convert the final BAM file created in a CRAM file to store it. One approach I have been told to do is to do this conversion but ensure that we don't lose data when storing the CRAM file and ensure we can convert the CRAM to a new FASTQ if need it in the future.
I am developing this bit. I have this code so far.
# ORIGINAL FASTQ
originalfastqgz="RACP2-6poolv6-P5-A_S1_L001_R1_001.fastq.gz"
echo "$originalfastqgz"
#gunzip -c "$originalfastqgz" > "/mainfs/scratch/mdb1c20/runs_Dan/230926_MN01972_0056_A000H5KYM3/RACP2-6poolv6-P5-A/RACP2-6poolv6-P5-A_S1_L001_R1_001.fastq"
#originalfastq="/mainfs/scratch/mdb1c20/runs_Dan/230926_MN01972_0056_A000H5KYM3/RACP2-6poolv6-P5-A/RACP2-6poolv6-P5-A_S1_L001_R1_001.fastq"
#echo "$originalfastq"
# The BAM FROM PIPELINE
bam="RACP2-6poolv6-P5-A_FINAL_SORTED.bam"
# BAM2CRAM
echo "######################### BAM TO CRAM ##########################################"
apptainer exec --bind "$ref_folder":"$ref_folder" "$picard" java -Xmx16G -jar /usr/picard/picard.jar SamFormatConverter \
INPUT="$bam" \
OUTPUT=RACP2-6poolv6-P5-A_FINAL_SORTED.cram \
REFERENCE_SEQUENCE=$bwarefgenomepath
echo "######################### CRAM CREATED ##########################################"
echo "######################### CRAM TO FASTQ ##########################################"
apptainer exec --bind "$ref_folder":"$ref_folder" "$picard" java -Xmx16G -jar /usr/picard/picard.jar SamToFastq \
I=RACP2-6poolv6-P5-A_FINAL_SORTED.cram \
FASTQ=RACP2-6poolv6-P5-A_S1_L001_R1_001__NEW__.fastq.gz \
REFERENCE_SEQUENCE=$bwarefgenomepath
echo "######################### NEW FASTQ CREATED ##########################################"
echo "######################### COMPARATIVE ANALYSIS ##########################################"
echo " Count the number of reads in "
echo "Original FASTQ: "
zcat "$originalfastqgz" | wc -l | awk '{print $1/4}'
echo "NEW FASTQ: "
zcat RACP2-6poolv6-P5-A_S1_L001_R1_001__NEW__.fastq.gz | wc -l | awk '{print $1/4}'
echo " Compute size in "
echo "Original FASTQ: "
du -sh "$originalfastqgz"
echo "NEW FASTQ: "
du -sh RACP2-6poolv6-P5-A_S1_L001_R1_001__NEW__.fastq.gz
echo "N reads of BAM: "
samtools view -c "$bam"
echo "N reads of CRAM: "
samtools view -c RACP2-6poolv6-P5-A_FINAL_SORTED.cram
and I have got these results
Count the number of reads in
Original FASTQ:
1753954
NEW FASTQ:
1741324
Compute size in
Original FASTQ:
129M RACP2-6poolv6-P5-A_S1_L001_R1_001.fastq.gz
NEW FASTQ:
41M RACP2-6poolv6-P5-A_S1_L001_R1_001__NEW__.fastq.gz
N reads of BAM:
1754263
N reads of CRAM:
1754263
My question is why the BAM and the CRAM have more reads than the original FASTQ file and why when converting the CRAM to FASTQ, I miss reads (comparing the original FASTQ with the newly created FASTQ)
From the Original-FASTQ to the new one I miss 12630 reads. The BAM (and CRAM) has 309 additional reads.
Reading the documentation I have found that I miss
INCLUDE_NON_PRIMARY_ALIGNMENTS=true
in theSamToFastq
. This reduces the difference to around 300 reads but still a large number of reads missing in the conversation.How can I compare the original and newly created to visualised only the missing reads??
Use
filterbyname.sh
from BBMap suite. Provide names in a file, one per line. Fastq headers without the@
at beginning.I solved this by adding
so the final code is
I took the new FASTQ file and I repeated the analysis to compare the VCF file from the original and from the new FASTQ file. Identical VCF files. :)