Question

Why extracting reads from BAM to fastq produces small amount of reads?

0

Entering edit mode

2.1 years ago

artemd ▴ 20

Hi,

I got a peculiar issue with one of my datasets - when I try to extract reads from a BAM file (which has 20 mil reads) to a fastq - I get only ~400k reads.

Here is what I do:

I first sort the bam by name:

samtools sort -n -o sample1.bam.sortedByName sample1.bam

Then I tried to extract fastq's by two methods:

samtools fastq sample1.bam.sortedByName -n -1 sample1.r1.fastq -2 sample1.r2.fastq -0 /dev/null -s /dev/null

method2:

bedtools bamtofastq -i sample1.bam.sortedByNam -fq sample1.r1.fastq -fq2sample1.r2.fastq

Both methods produce only 400K reads.

While running the bedtools command it prints endless stream of errors along the lines of "*WARNING: Query XXX is marked as paired, but its mate does not occur next to it in your BAM file. Skipping."

I searched for the name of the read in the name-sorted file and it looks like this: both reads are next to each other, so I'm not sure why I'm getting this warning. Here are the reads as they appear in the bam file in the bam file:

HWI-D00792:38:C6RFNANXX:7:2316:21292:95981#AGTACAAGAGTGGTCA 0 17 67187275 60 100M 0 0 TGAACACTTACATATATTCTATGTGATTACACAGTAAGTTACCTGGAAATTTGTTC HWI-D00792:38:C6RFNANXX:7:2316:21292:95981#AGTACAAGAGTGGTCA 16 17 67187369 60 100M 0 0 ACTTTCTGTTGTTAACTTGGCATCAGGAATGTGCTGCTTAATAAGGGATGTGATTT HWI-D00

I also run samtools flagstat on the original bam and it looked like this:

21404749 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
88046 + 0 supplementary
0 + 0 duplicates
21321753 + 0 mapped (99.61% : N/A)
990948 + 0 paired in sequencing
492327 + 0 read1
498621 + 0 read2
767804 + 0 properly paired (77.48% : N/A)
904244 + 0 with itself and mate mapped
31149 + 0 singletons (3.14% : N/A)
58439 + 0 with mate mapped to a different chr
19917 + 0 with mate mapped to a different chr (mapQ>=5)

Please let me know if you spot anything that could cause the problems I'm having, and whether there is a solution that I might want to try. Thanks,

picard samtools BAM fastq • 973 views

ADD COMMENT • link updated 2.1 years ago by ATpoint 89k • written 2.1 years ago by artemd ▴ 20

score 2 · Accepted Answer · 2023-08-06

2

Entering edit mode

2.1 years ago

ATpoint 89k

As the flagstat tells, most reads are not paired. Why that is you have to sort out with the person you created the bam. In samtools fastq you have to give a filename to -0 to retain these unpaired reads.

ADD COMMENT • link 2.1 years ago by ATpoint 89k