Hello,
I would like to be able to align around 500 short sequences (~20 nucleotides) to the human genome in a super quick way.
I would like to authorize up to 5 mismatches but no gaps. I also would like to retrieve all possible hits for one sequence... not just the best ones. Then, I would like to be able to annotate the hits.
I will need to repeat this work very often.
I investigated the NGS tools like bowtie or BWA to see if I could tweak them but it seems that the number of mismatches are limited to 3 and 1 respectively...this is a pity because it would have been great to have the results in a SAM format to do the follow up annotation step!
Could anyone advise me one a very fast aligner to do what I would like please?
Many thanks!
did you play with the options of bwa ?
I got curious about this... Actually it seems that persuading bwa mem to align very short reads it's not easy... Here's a little test I performed on hg19 genome. One read is 20bp with no mismatches, the other has just 1 mismatch. I can't get bwa mem to align the second read!
(the second read has 1 mismatch on the second base)
I set
-B
and-T
to be very permissive but I had no luck in aligning both reads! (I played a bit also with-w
, no better)Thanks for trying Dariober!
I do not know well BWA so I do not get the out put very well... but could it be that he only returns the best alignment?
Yes, by default bwa returns the best alignment and in case of ties returns one of the best hits at random. The point here is that the second read is reported as unmapped while I expected to be mapped at the same location as the first one.
bwa is not a generic aligner, it was not designed to work for very short reads, bowtie does however work to some extent for miRNA sized reads (18-24bp)
Indeed, I just thought with some massaging of the parameters it could be done... But as I said below, an NGS aligner might not be the best choice...
Hello
Thanks for answering!
Would you have an idea of how to play with this parameter please? I do not see how I could control the number of mismatches from this...
Thanks a lot
Maybe NGS tools are not the way to go...
They just sound interesting in the first place. But apparently they are not designed to provide all possible alignment hits for a given sequence... (?)
I was going to suggest that NGS aligners are not the best choice... As a side note, keep in mind that allowing 5 mismatches in 20bp and asking for all the hits is probably going to generate millions of records in output.
One option I would try is to use a general alignment tool like vmatch.
The problem is that the more mismatches you allow in a short read the more diverse the read becomes and more locations it can align to. That is not what most fast mappers were designed to do, these aligners want to find the single best location. You can't have both fast alignments and very flexible ones. It is very unlikely that you could tweak these aligners to do what you want.
Maybe you could generate all possible mismatches with a program then look for exact matches with the aligner.
Thanks Albert! That is an idea. I would have to think about it though. It sounds that this would have a high complexity and I might gain time for the alignment but spend a lot of time generating all the possibilities. I would not know how to do this efficiently..
On a different note, I was looking at MegaBlast... would someone know this one. Could it help?
It says:
You are looking for 5 mismatches in 20 reads, that is 25% different, in sequencing that is actually a lot! plus it would be a local alignment, likely not what you are looking for.
As the manual says, bwa-mem is not intended for very short reads. With bwa-aln, you need something like "bwa aln -n5 -k5 -o1". It will be very slow, but for 500 sequences, it should be find. Beware that you are likely to get a huge pile of random hits with 5 mismatches in a 20bp sequence.
Check out the Bitap algorithm, it looks like this is what you need