Different number of reads when converting data from FASTQ to BAM and CRAM to FASTQ
1
0
Entering edit mode
13 months ago
ManuelDB ▴ 110

This post is following up on this other question FASTQ to BAM to CRAM to FASTQ.

I have developed an NGS pipeline for calling variants from amplicon data. Regarding the backup, we want to convert the final BAM file created in a CRAM file to store it. One approach I have been told to do is to do this conversion but ensure that we don't lose data when storing the CRAM file and ensure we can convert the CRAM to a new FASTQ if need it in the future.

I am developing this bit. I have this code so far.

 # ORIGINAL FASTQ
originalfastqgz="RACP2-6poolv6-P5-A_S1_L001_R1_001.fastq.gz"
echo "$originalfastqgz"

#gunzip -c "$originalfastqgz" > "/mainfs/scratch/mdb1c20/runs_Dan/230926_MN01972_0056_A000H5KYM3/RACP2-6poolv6-P5-A/RACP2-6poolv6-P5-A_S1_L001_R1_001.fastq"

#originalfastq="/mainfs/scratch/mdb1c20/runs_Dan/230926_MN01972_0056_A000H5KYM3/RACP2-6poolv6-P5-A/RACP2-6poolv6-P5-A_S1_L001_R1_001.fastq"
#echo "$originalfastq"

# The BAM FROM PIPELINE
bam="RACP2-6poolv6-P5-A_FINAL_SORTED.bam"

# BAM2CRAM

echo "#########################   BAM TO CRAM   ##########################################"
apptainer exec --bind "$ref_folder":"$ref_folder" "$picard" java -Xmx16G -jar /usr/picard/picard.jar  SamFormatConverter \
  INPUT="$bam" \
  OUTPUT=RACP2-6poolv6-P5-A_FINAL_SORTED.cram \
  REFERENCE_SEQUENCE=$bwarefgenomepath
echo "#########################   CRAM CREATED   ##########################################"

echo "#########################   CRAM TO FASTQ   ##########################################"


apptainer exec --bind "$ref_folder":"$ref_folder" "$picard" java -Xmx16G -jar /usr/picard/picard.jar SamToFastq \
  I=RACP2-6poolv6-P5-A_FINAL_SORTED.cram \
  FASTQ=RACP2-6poolv6-P5-A_S1_L001_R1_001__NEW__.fastq.gz \
  REFERENCE_SEQUENCE=$bwarefgenomepath

echo "#########################   NEW FASTQ CREATED  ##########################################"


echo "#########################   COMPARATIVE ANALYSIS  ##########################################"


echo " Count the number of reads in "

echo "Original FASTQ: "
zcat "$originalfastqgz"  | wc -l | awk '{print $1/4}'


echo "NEW FASTQ: "
zcat RACP2-6poolv6-P5-A_S1_L001_R1_001__NEW__.fastq.gz  | wc -l | awk '{print $1/4}'


echo " Compute size in "
echo "Original FASTQ: "
du -sh "$originalfastqgz"

echo "NEW FASTQ: "
du -sh RACP2-6poolv6-P5-A_S1_L001_R1_001__NEW__.fastq.gz

echo "N reads of BAM: "
samtools view -c "$bam"


echo "N reads of CRAM: "
samtools view -c RACP2-6poolv6-P5-A_FINAL_SORTED.cram

and I have got these results

  Count the number of reads in
Original FASTQ:
1753954
NEW FASTQ:
1741324
 Compute size in
Original FASTQ:
129M    RACP2-6poolv6-P5-A_S1_L001_R1_001.fastq.gz
NEW FASTQ:
41M     RACP2-6poolv6-P5-A_S1_L001_R1_001__NEW__.fastq.gz
N reads of BAM:
1754263
N reads of CRAM:
1754263

My question is why the BAM and the CRAM have more reads than the original FASTQ file and why when converting the CRAM to FASTQ, I miss reads (comparing the original FASTQ with the newly created FASTQ)

From the Original-FASTQ to the new one I miss 12630 reads. The BAM (and CRAM) has 309 additional reads.

picard • 1.3k views
ADD COMMENT
0
Entering edit mode

Reading the documentation I have found that I miss INCLUDE_NON_PRIMARY_ALIGNMENTS=true in the SamToFastq. This reduces the difference to around 300 reads but still a large number of reads missing in the conversation.

How can I compare the original and newly created to visualised only the missing reads??

ADD REPLY
1
Entering edit mode

How can I compare the original and newly created to visualised only the missing reads??

Use filterbyname.sh from BBMap suite. Provide names in a file, one per line. Fastq headers without the @ at beginning.

ADD REPLY
0
Entering edit mode

I solved this by adding

INCLUDE_NON_PRIMARY_ALIGNMENTS=true \
INCLUDE_NON_PF_READS=true

so the final code is

apptainer exec --bind "$ref_folder":"$ref_folder" "$picard" java -Xmx16G -jar /usr/picard/picard.jar  SamFormatConverter \
               INPUT="${Sample_ID}"_FINAL_SORTED.bam \
               OUTPUT="${Sample_ID}".cram \
               REFERENCE_SEQUENCE=$bwarefgenomepath



apptainer exec --bind "$ref_folder":"$ref_folder" "$picard" java -Xmx16G -jar /usr/picard/picard.jar SamToFastq \
               I=${Sample_ID}".cram \
               FASTQ="${Sample_ID}"_BACKUP.fastq.gz \
               REFERENCE_SEQUENCE=$bwarefgenomepath \
               INCLUDE_NON_PRIMARY_ALIGNMENTS=true \
               INCLUDE_NON_PF_READS=true

I took the new FASTQ file and I repeated the analysis to compare the VCF file from the original and from the new FASTQ file. Identical VCF files. :)

ADD REPLY
3
Entering edit mode
13 months ago

My question is why the BAM and the CRAM have more reads than the original FASTQ file and why when converting the CRAM to FASTQ, I miss reads (comparing the original FASTQ with the newly created FASTQ)

secondary reads supplementary reads

ADD COMMENT
0
Entering edit mode

Ahh!! I know about that but what I didn't know is that secondary reads and supplementary reads create new reads. Thank you for that!

Still wondering why I missed the reads from the Original FASTQ to the new one.

ADD REPLY
1
Entering edit mode
Original FASTQ:
1753954
NEW FASTQ:
1741324

Something is still off.

Curious as to how much real world space savings are you getting from all these manipulations? Are you dealing with tens of thousands of samples where savings will actually matter?

ADD REPLY
0
Entering edit mode

Yes, I also saw that.

Yes, in one year we will reach that amount. If this is a major issue we don't figure out why this is happening, this may change our mind and store directly the original FASTQ. But at least, I would like to understand why this is happening.

ADD REPLY
0
Entering edit mode

Storing as unaligned BAM may still be useful. How much savings are you getting from storing as CRAM compared to BAM?

ADD REPLY
0
Entering edit mode

~1/2. 45487 KB for the BAM and 28089 KB for the CRAM

ADD REPLY
0
Entering edit mode

There is something in between original FASTQ and BAM/CRAM.

One can cluster similar reads i.e with 'clumpify' from BBMap and check out few block sizes/compression levels to shrink the FASTQ sizes. Also there may be some runs with a significant number of failed/non mappable/too short insert reads. Depending on the policies you may try to filter them out, trim adapters, etc. Or store good and not_so_good reads in separate FASTQ s.

ADD REPLY

Login before adding your answer.

Traffic: 2388 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6