Question

BWA MEM mapping issue

0

Entering edit mode

9.7 years ago

marina.v.yurieva ▴ 580

I'm mapping two samples independently (PE reads) using bwa-mem:

bwa mem -t 3 -M ref.fasta sample1_R1.fastq sample1_R2.fastq > sample1.sam # (2121026 mapped reads)
bwa mem -t 3 -M ref.fasta sample2_R1.fastq sample2_R2.fastq > sample2.sam #(2485112 mapped reads)

But when I pool reads from two samples together and map them the number of mapped reads is higher: 5315859 reads vs 4606138. Why is this happening? Don't the reads map independently?

alignment next-gen bwa • 3.9k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

Usually things like this end up being due to the use of a random number generator at some point during the seeding step.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by Devon Ryan 104k

1

Entering edit mode

Don't the reads map independently?

I would think so myself... @Devon Isn't the difference 53M vs 46M a bit too large to be due to randomness?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by dariober 15k

0

Entering edit mode

Don't the reads map independently?

Well at least, there are some calculations about the median size of the inserts in a pool of reads.

-a INT Maximum insert size for a read pair to be considered being mapped properly. Since 0.4.5, this option is only used when there are not enough good alignment to infer the distribution of insert sizes. [500]

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Pierre, but this is sampe option, not mem...

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

I think there is the same algorithm in mem:

[M::main_mem] read 200 sequences (20000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (42, 0, 0, 37)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (15, 23, 30)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 60)

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Yeah, I'm pretty sure it infers the insert size (and library type) from the data. So different data could yield different results. However, it shouldn't be that different unless one of the datasets is just not being handled well and the other is helping bwa infer things (or, inversely, one dataset is just borking the inference step and then throwing everything off). Doesn't picard have an command that will give things like insert size distribution? It'd be interesting to run that on the two SAM files produced individually to see if they have very different insert size distributions.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

I think there is something wrong with my fastq files. I subsampled them and did the same analysis and everything adds up and the distribution of insert sizes looks good. But for the original files the distribution plot shows very long insert size (~8000) and shows it as "Tandem" instead of "FR". And the distribution plot on two original samples pooled together looks good. Looks like bwa has a problem with my original fastq files...

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

check if you have some empty reads in your FASTQ.

check the number of reads in each FASTQ R1/R2: numbers should be the same.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yeah, that does seem a bit over the top to me too. Perhaps marina.v.yurieva can post the file somewhere just so this can be confirmed.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.7 years ago by Devon Ryan 104k