Entering edit mode
9.7 years ago
marina.v.yurieva
▴
580
I'm mapping two samples independently (PE reads) using bwa-mem:
bwa mem -t 3 -M ref.fasta sample1_R1.fastq sample1_R2.fastq > sample1.sam # (2121026 mapped reads)
bwa mem -t 3 -M ref.fasta sample2_R1.fastq sample2_R2.fastq > sample2.sam #(2485112 mapped reads)
But when I pool reads from two samples together and map them the number of mapped reads is higher: 5315859 reads vs 4606138. Why is this happening? Don't the reads map independently?
Usually things like this end up being due to the use of a random number generator at some point during the seeding step.
I would think so myself... @Devon Isn't the difference 53M vs 46M a bit too large to be due to randomness?
Well at least, there are some calculations about the median size of the inserts in a pool of reads.
Pierre, but this is
sampe
option, notmem
...I think there is the same algorithm in mem:
Yeah, I'm pretty sure it infers the insert size (and library type) from the data. So different data could yield different results. However, it shouldn't be that different unless one of the datasets is just not being handled well and the other is helping bwa infer things (or, inversely, one dataset is just borking the inference step and then throwing everything off). Doesn't picard have an command that will give things like insert size distribution? It'd be interesting to run that on the two SAM files produced individually to see if they have very different insert size distributions.
I think there is something wrong with my fastq files. I subsampled them and did the same analysis and everything adds up and the distribution of insert sizes looks good. But for the original files the distribution plot shows very long insert size (~8000) and shows it as "Tandem" instead of "FR". And the distribution plot on two original samples pooled together looks good. Looks like bwa has a problem with my original fastq files...
check if you have some empty reads in your FASTQ.
check the number of reads in each FASTQ R1/R2: numbers should be the same.
Yeah, that does seem a bit over the top to me too. Perhaps marina.v.yurieva can post the file somewhere just so this can be confirmed.