Hi,
I got a peculiar issue with one of my datasets - when I try to extract reads from a BAM file (which has 20 mil reads) to a fastq - I get only ~400k reads.
Here is what I do:
I first sort the bam by name:
samtools sort -n -o sample1.bam.sortedByName sample1.bam
Then I tried to extract fastq's by two methods:
samtools fastq sample1.bam.sortedByName -n -1 sample1.r1.fastq -2 sample1.r2.fastq -0 /dev/null -s /dev/null
method2:
bedtools bamtofastq -i sample1.bam.sortedByNam -fq sample1.r1.fastq -fq2sample1.r2.fastq
Both methods produce only 400K reads.
While running the bedtools command it prints endless stream of errors along the lines of "*WARNING: Query XXX is marked as paired, but its mate does not occur next to it in your BAM file. Skipping."
I searched for the name of the read in the name-sorted file and it looks like this: both reads are next to each other, so I'm not sure why I'm getting this warning. Here are the reads as they appear in the bam file in the bam file:
HWI-D00792:38:C6RFNANXX:7:2316:21292:95981#AGTACAAGAGTGGTCA 0 17 67187275 60 100M 0 0 TGAACACTTACATATATTCTATGTGATTACACAGTAAGTTACCTGGAAATTTGTTC HWI-D00792:38:C6RFNANXX:7:2316:21292:95981#AGTACAAGAGTGGTCA 16 17 67187369 60 100M 0 0 ACTTTCTGTTGTTAACTTGGCATCAGGAATGTGCTGCTTAATAAGGGATGTGATTT HWI-D00
I also run samtools flagstat on the original bam and it looked like this:
21404749 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
88046 + 0 supplementary
0 + 0 duplicates
21321753 + 0 mapped (99.61% : N/A)
990948 + 0 paired in sequencing
492327 + 0 read1
498621 + 0 read2
767804 + 0 properly paired (77.48% : N/A)
904244 + 0 with itself and mate mapped
31149 + 0 singletons (3.14% : N/A)
58439 + 0 with mate mapped to a different chr
19917 + 0 with mate mapped to a different chr (mapQ>=5)
Please let me know if you spot anything that could cause the problems I'm having, and whether there is a solution that I might want to try. Thanks,