Question

BWA mem minimum read length

2

Entering edit mode

4.1 years ago

yolek64754 ▴ 30

Hi everyone,

I am currently mapping read from different read length (from 30 to 100) and I have been using BWA-mem and BWA-aln. I know that BWA-aln works better for those read's length but while I was analysing my bam files I realised that every reads < 35bp are set as "unmapped" by BWA-mem. So here is my question:

Does the reads shorter than 35bp does not map because BWA-mem is not optimised for those reads ? If yes, why why some of 37-39 would map? (I have around 50K reads < 35bp and 100K < 40bp)

or

Is there, somewhere in the code, a limit of read length where BWA-mem doesn't even try to align the reads ? Would be good to find a proof in the code but didn't find anything about that in the code (https://github.com/lh3/bwa). My C skills are not the best tho' so I might have missed it.

Thanks a lot !

bwa mem alignment short dna • 4.4k views

ADD COMMENT • link updated 3.6 years ago by boczniak767 ▴ 870 • written 4.1 years ago by yolek64754 ▴ 30

1

Entering edit mode

There is no hard cutoff in the code afaik. It is probably a combination of the alignment score with respect to other possible locations that the reads could map to. The shorter the read, the higher the probability that they multimap. Parameters in mem are optimized for reads > 70bp. So you have 150k very short reads, do they even matter in the bigger picture? You typically have millions of reads, do you? Why not just ignoring them?

ADD REPLY • link 4.1 years ago by ATpoint 85k

0

Entering edit mode

Thanks for the quick answer. Yeah I don't use them in my analysis, I was just curious of why some reads of 36-38 bp mapped where none of 35 and below does. I was looking at the code and couldn't find any strict cutoff value so I was kinda lost but yeah I think it is probably some quality score becoming way too low to map them. Thanks !

ADD REPLY • link 4.1 years ago by yolek64754 ▴ 30

0

Entering edit mode

Have you checked whether the 35 bp reads are all Ns? bcl2fastq, the program most commonly used to generate fastq files from raw bcl, by default resets reads too short after adtapter trimming to a strech of 35 Ns. This could explain the 35 to 36 bp gap.

ADD REPLY • link 4.1 years ago by dariober 15k

0

Entering edit mode

Hi, I didn't mentioned it but the reads I am mapping are simulations and hence are not Ns. Thanks a lot, that could be an explanation for empirical data.

ADD REPLY • link 4.1 years ago by yolek64754 ▴ 30

0

Entering edit mode

If you can upload a representative sample of reads, say a few hundred thousand, to e.g. dropbox along with the reference genome (either the genome itself or a link to it) maybe some of us can take a look at the issue.

ADD REPLY • link 4.1 years ago by dariober 15k

0

Entering edit mode

Somehow related, there is faster version of bwa-mem bwa-mem2

ADD REPLY • link 3.6 years ago by boczniak767 ▴ 870