Entering edit mode
4.5 years ago
oconnwald
▴
20
I was wondering how the casing of base calls affected BWA MEM performance.
I ran three sets of data to see if there was any difference at all:
- All base calls capitalized
- All base calls lower-cased
- Random casing for all base calls
I found that the first two sets (when casing was consistent) produced the same exact results whereas there were subtle differences in the last fastq. Here are the samtools flagstats for each:
All upper:
618526 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1242 + 0 supplementary
0 + 0 duplicates
616822 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612104 + 0 properly paired (99.16% : N/A)
614584 + 0 with itself and mate mapped
996 + 0 singletons (0.16% : N/A)
2130 + 0 with mate mapped to a different chr
1310 + 0 with mate mapped to a different chr (mapQ>=5)
All lower:
618526 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1242 + 0 supplementary
0 + 0 duplicates
616822 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612104 + 0 properly paired (99.16% : N/A)
614584 + 0 with itself and mate mapped
996 + 0 singletons (0.16% : N/A)
2130 + 0 with mate mapped to a different chr
1310 + 0 with mate mapped to a different chr (mapQ>=5)
Random-casing:
618538 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1254 + 0 supplementary
0 + 0 duplicates
616835 + 0 mapped (99.72% : N/A)
617284 + 0 paired in sequencing
308642 + 0 read1
308642 + 0 read2
612096 + 0 properly paired (99.16% : N/A)
614586 + 0 with itself and mate mapped
995 + 0 singletons (0.16% : N/A)
2140 + 0 with mate mapped to a different chr
1319 + 0 with mate mapped to a different chr (mapQ>=5)
Does anyone know why the aligner is treating the random cased fastq differently?
Here is my BWA/Samtools sort call:
${PATH_TO_BWA} mem \
-R "@RG\tID:${PREFIX}\tSM:${PREFIX}\tLB:${PREFIX}\tPL:ILLUMINA" \
-t ${BWA_THREADS} \
-Y \
${REF_PREFIX} ${R1_FASTQ} ${R2_FASTQ} \
| \
${PATH_TO_SAMTOOLS} sort - \
-@ ${SAMTOOLS_THREADS} \
-o ${OUTPUT_BAM} \
-T ${TMP_DIR}/${PREFIX}_samtools_tmp
See: A: BWA mem output inconsistent on same but re-ordered FASTQ input
I don't think
bwa
has a deterministic option to ensure that you get exactly the same result each time. If you are truly interested then you will need to dive into code to see if the case has any effect.Case has a effect when present in reference (A: Which Aligners Recognize Soft-Masked Repeats In Reference Sequences? ) for some aligners. Heng Li does not recommend using a masked reference.
That's interesting -- but the ordering of the reads isn't changing in each run (only casing), so I don't know if the first link applies. We are using a reference that has masked bases though.
As an update, I've run the random-cased fastqs alignment with our reference modified to not include masking (i.e. I capitalized all lower-case bases), and I'm getting equivalent results as I did when I ran alignment of the fastqs with the reference with masking. So, I think the casing is here making a difference?
How did you lower case and random case fastqs? My suspicion is the reads aren't identical - that is, in addition to the different casing.
The number of reads us different in your third sample. Find out why.
Total reads includes supplemental reads -- you can see the offset on both is +12. If there were variation in the number of raw reads in the same, you'd see that reflected in the read1 and read2 counts
Just a simple Python script (I'm happy to post it if you'd like). I've confirmed that the random casing reads are identical in content (other than casing) by converting the random-cased bases back to the original upper casing and running a diff.