Hi everyone,
I am currently mapping read from different read length (from 30 to 100) and I have been using BWA-mem and BWA-aln. I know that BWA-aln works better for those read's length but while I was analysing my bam files I realised that every reads < 35bp are set as "unmapped" by BWA-mem. So here is my question:
Does the reads shorter than 35bp does not map because BWA-mem is not optimised for those reads ? If yes, why why some of 37-39 would map? (I have around 50K reads < 35bp and 100K < 40bp)
or
Is there, somewhere in the code, a limit of read length where BWA-mem doesn't even try to align the reads ? Would be good to find a proof in the code but didn't find anything about that in the code (https://github.com/lh3/bwa). My C skills are not the best tho' so I might have missed it.
Thanks a lot !
There is no hard cutoff in the code afaik. It is probably a combination of the alignment score with respect to other possible locations that the reads could map to. The shorter the read, the higher the probability that they multimap. Parameters in
mem
are optimized for reads > 70bp. So you have 150k very short reads, do they even matter in the bigger picture? You typically have millions of reads, do you? Why not just ignoring them?Thanks for the quick answer. Yeah I don't use them in my analysis, I was just curious of why some reads of 36-38 bp mapped where none of 35 and below does. I was looking at the code and couldn't find any strict cutoff value so I was kinda lost but yeah I think it is probably some quality score becoming way too low to map them. Thanks !
Have you checked whether the 35 bp reads are all Ns? bcl2fastq, the program most commonly used to generate fastq files from raw bcl, by default resets reads too short after adtapter trimming to a strech of 35 Ns. This could explain the 35 to 36 bp gap.
Hi, I didn't mentioned it but the reads I am mapping are simulations and hence are not Ns. Thanks a lot, that could be an explanation for empirical data.
If you can upload a representative sample of reads, say a few hundred thousand, to e.g. dropbox along with the reference genome (either the genome itself or a link to it) maybe some of us can take a look at the issue.
Somehow related, there is faster version of
bwa-mem
bwa-mem2