I am using BWA to map PE reads from 300-400bp fragments to a reference genome. After mapping I am using SAMtools to filter out "properly mapping pairs" using the -f 0x002 option.
Here is my BWA/SAMtools syntaxt
$> bwa aln -t 12 -f <input index read 1> myoLuc2.bwa.bwtsw <read 1 fastq>
$> bwa aln -t 12 -f <input index read 2> myoLuc2.bwa.bwtsw <read 2 fastq>
$> bwa sampe -a 400 myoLuc2.bwa.bwtsw <input index read 1> <input index read 2> <read1 fastq> <read2 fastq> >mluc_sampe.sam
$> samtools view -Sb -f 0x002 mluc_sampe.sam | bedtools bamtobed -i - | bedtools sort -i - >mluc_sampe_sorted.bed
Looking at the final BED file...in some cases a read pair maps as would be expected:
AAPE02060494 12296 12318 D3NH4HQ1:150:C1FTFACXX:7:2211:19990:32392/1 15 +
AAPE02060494 12619 12680 D3NH4HQ1:150:C1FTFACXX:7:2211:19990:32392/2 15 -
In this case the ...32392/1 maps ~300 bp downstream from 32392/2 and on the opposite strand.
However the VAST MAJORITY of the reads look like the following:
AAPE02062872 9054 9148 D3NH4HQ1:150:C1FTFACXX:7:1212:7825:54903/1 60 +
AAPE02062872 9054 9148 D3NH4HQ1:150:C1FTFACXX:7:1216:8413:88366/1 60 +
AAPE02062872 9054 9148 D3NH4HQ1:150:C1FTFACXX:7:1303:1882:25192/1 60 +
AAPE02062872 9054 9148 D3NH4HQ1:150:C1FTFACXX:7:2109:3484:63930/1 60 +
AAPE02062872 9310 9371 D3NH4HQ1:150:C1FTFACXX:7:1303:18099:51689/2 60 -
AAPE02062872 9311 9371 D3NH4HQ1:150:C1FTFACXX:7:1212:4340:70026/2 60 -
AAPE02062872 9312 9371 D3NH4HQ1:150:C1FTFACXX:7:1301:19221:37959/2 60 -
AAPE02062872 9312 9371 D3NH4HQ1:150:C1FTFACXX:7:2109:14642:87311/2 60 -
In the above scenario the first read ...54903/1 is not mapped with the 54903/2 read. These are the only 8 reads mapping to the AAPE02062872 contig. In fact the 54903/2 read is not mapped period. None of the 8 reads in the above example comprise pairs from the same flowcell location (fragment) though to the eye, they "look" like properly mapping pairs. Should these have been filtered out by SAMtools? Does SAMtools or BWA do something that I am not aware of to justify calling these "properly mapping?"
I checked to see if my fastq files were out of order causing BWA to combine reads that shouldn't be.
The ...54903 reads are both in the input fastq files and at the same line number within that file.
$> grep -n 54903 mluc-FLK_sorted.fq
7861:@D3NH4HQ1:150:C1FTFACXX:7:1212:7825:54903 2:N:0:GCCTAA
$> grep -n 54903 mluc-VES_sorted.fq
7861:@D3NH4HQ1:150:C1FTFACXX:7:1212:7825:54903 1:N:0:GCCTAA
Am I missing something? Any suggestions or hints? Thanks for your help in advance.
I have included the SAM output below. It is pretty ugly as far as formatting...not sure how to make it prettier.
$> grep AAPE02062872 mluc_sampe.sam
@SQ SN:AAPE02062872 LN:18857
D3NH4HQ1:150:C1FTFACXX:7:1212:7825:54903 99 AAPE02062872 9055 60 94M = 9312 317 NTAAAAAAAATATATTTTTTAAAGATGGGGCCTAGGATGTACCAACTGATTATATGCTAATTATTGGTTGGAGGTGAGGGAATCTTAGAACTTG #2<CFHHIJJJJJJJJJJJJJJIJIJJJJJJJJJIJHIIGIGHHEEHFFFFFFFEEEEDEEEDEFDDDDDDCDB<A>BDBDDBDDCACCCACCA XT:A:U NM:i:73 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:2XO:i:0 XG:i:0 MD:Z:0T0G0T0T0T0T0G0C0C0T1T0A0G1G0G0A0A0A1G0G0A0C0A0C0C0A0T0T0G0A1A0T0T4T0G0T0T0T1A0G0G1T0C1A1T0A1T0G1T2C0A1A0T1C0T0C0A0T0T1A0A0G1A0A0C0A0T0T0T0T0A0A0A1
D3NH4HQ1:150:C1FTFACXX:7:1212:4340:70026 147 AAPE02062872 9312 60 60M = 9055 -317 ATAAGCGATTGTATAGAAGCCTCAATTCACCAGACTCTATACTAAAAATGGGTGGATTTT C>DDDDDDCEDEEDDDDDDDDDCDBDDCACDDCCDDEEEEEEDFFFFFHHHHJJJIJJJJ XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:4C55
D3NH4HQ1:150:C1FTFACXX:7:1216:8413:88366 99 AAPE02062872 9055 60 94M = 9313 317 TCAGGAAAAACCCCCACACAATTTTTAATAAACAGAATTCTGTTACATGACAGTACTGGTGTTACTGCTAAAGGCTTGCAACAATTATCTAATT 8BFHHHHJJJJJJIJJJJJJEIJJJJJJJGIIIJJIJJIJIJGIGGIIJIJIJGIIJJHEGHHHHHHFFFFCEACEEDDCDDDDDD<CCDC:A> XT:A:U NM:i:70 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1XO:i:0 XG:i:0 MD:Z:1G0T0T0T0T0G0C0C0T0T0T0A0G0T0G0G1A2G0G0A0C0A0C0C0A0T0T0G0A1A0T0T1G0T0A0T0G1T0T0T0A1G0T0T0C0A1G1A0A4T0T0C0A0T0A0T0G0C0T0C0A3A1G0A2C0A0T1T3A0G0
D3NH4HQ1:150:C1FTFACXX:7:1301:19221:37959 147 AAPE02062872 9313 60 59M = 9055 -317 TAAGCGATTGTATAGAAGCCTCAATTCACCAGACTCTATACTAAAAATGGGTGGATTTT @@B@BCCDDDEEDDDDDDCACDDC>DDDCCCC@DDEEFCAECDBCDFFHHHGIIGJIIH XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:3C55
D3NH4HQ1:150:C1FTFACXX:7:1303:1882:25192 99 AAPE02062872 9055 60 94M = 9311 317 TTTCCCGGCTCCCTTACCAATAGAGTTAGTCATGTGATATGGTTCCAGCCAATGAGTTGTAAACACAGGTTTTGAATGAGACTTCCTAGAAGTT +=CFHHHJJJJJJJJJJJJJJJJJJCFHIHJJJJIJJIJIJJGGFGHIIJJGIJGIGJJGIGIJJJHHHAHFFFDEEE@C@CDDDDCDCDCDC# XT:A:U NM:i:65 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0XO:i:0 XG:i:0 MD:Z:1G1T0T0T1C2T0T0A0G1G0G0A2A0G2C0A0C0C0A1T0G0A0A0A0T0T1G1A0T0G1T0T0T0A0G0G0T0T0C0A3A0A1G0T0T0T0T2T0A1G0C1C1T3A0G0A0A0A1A1T0T0T1A0A0G0
D3NH4HQ1:150:C1FTFACXX:7:1303:18099:51689 147 AAPE02062872 9311 60 61M = 9055 -317 GATAAGCGATTGTATAGAAGCCTCAATTCACCAGACTCTATACTAAAAATGGGTGGATTTT :DBA?D@DCDEDDEDCDDDDDDCC@DCCCCDDDEEEEEEEEDFFFFHHHFHIJJJIJJJJJ XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:5C55
D3NH4HQ1:150:C1FTFACXX:7:2109:3484:63930 99 AAPE02062872 9055 60 94M = 9313 317 NGAGTCTGATGATATTAAGTACAAGTAGATATAATAACTAGAATTCCCTTTCATTGCTACTAGAAGTACAAATGTTACAACCGTTTCGAAAAAT #4CFFHHJJJJJJJJJJJJJIJJJJHIJJJJJJJJJJJJJJJJJHIJJJJJJIJJHJJJJJJIJIJIIJIGIJJGIJIHHHHFFFEDDDDDDD5 XT:A:U NM:i:71 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1XO:i:0 XG:i:0 MD:Z:0T1T0T1T0G0C0C1T0T0A0G1G0G1A0A1G0G1C0A0C0C2T0G2A0T0T0T0G0T0A0T0G2T0T0A0G0G1T0C0A0A1T0A1T0G0T0T0T0T0C0A0T0A0T0G0C1C0A1T0G2G0A0A0A0C0A0T0T0T0T3G0
D3NH4HQ1:150:C1FTFACXX:7:2109:14642:87311 147 AAPE02062872 9313 60 59M = 9055 -317 TAAGCGATTGTATAGAAGCCTCAATTCACCAGACTCTATACTAAAAATGGGTGGATTTT #DDDDDDDDDEEDDDDDDDDDDEEDDDDDDCC@DDEFEDCDDDEEEDDDFFFFFHHHHJ XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:3C55
Have you looked at the SAM file? It will be easier to diagnose this from there.
I have updated with information from the SAM file. Thanks.