I have an interesting problem. I'm aligning some paired end reads, each read file (read1/read2) has 3235888 reads per file or 3235888 read pairs as input. My alignment command is...
bwa sampe hg19.reference.fa aln.r1.sai aln.r2.sai read1.fq read2.fq > aligned.bam
After alignment I checked my bam file by running samtools flagstat and I got this....
6471776 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
6416289 + 0 mapped (99.14% : N/A)
6471776 + 0 paired in sequencing
3235888 + 0 read1
3235888 + 0 read2
6380400 + 0 properly paired (98.59% : N/A)
6398872 + 0 with itself and mate mapped
17417 + 0 singletons (0.27% : N/A)
12428 + 0 with mate mapped to a different chr
8191 + 0 with mate mapped to a different chr (mapQ>=5)
The numbers match up perfectly with my input. Everything looks okay so far. However, a problem occurs when I try to find certain read pairs in the bam file. For example, this read pair exists in the input fastq files...
@K00123:1845:HCCV6ACX1:2:1113:5997:6622
AGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
DDDDDIIIIIIIIIIIIHIGIIIIIIIIIIIIGHIIIIIIIIIIIGHII
@K00123:1845:HCCV6ACX1:2:1113:5997:6622
CCGAACACAATAACTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG
+
00000//111111<1<<111111<<F<11DGE@G?11<<<0CEDH<CCE
But when I search for this read pair in the bam using samtools view, I cannot find it! My search command is...
samtools view aligned.bam | grep AGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
samtools view aligned.bam | grep CCGAACACAATAACTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG
These commands return nothing. At first I though that samtools view might not print all lines so I checked using this command...
samtools view aligned.bam | wc -l
This command returns 6471776, which indicates that all reads are being printed by the samtools view command and yet I still cannot see that read pair. Any ideas what might be going on here?
as these are read pairs, I would expect one of the two to map directly on the reference
Thanks! The read was reverse complemented and its mate was also reverse complemented. I was able to verify by following your suggestion of grepping the read name. Ugh, such a newbie mistake :/