Question

Ultrafast alignment of short sequences to a genome (with up to 5 mismatches!)

1

Entering edit mode

10.1 years ago

Aurelie MLB ▴ 360

Hello,

I would like to be able to align around 500 short sequences (~20 nucleotides) to the human genome in a super quick way.

I would like to authorize up to 5 mismatches but no gaps. I also would like to retrieve all possible hits for one sequence... not just the best ones. Then, I would like to be able to annotate the hits.

I will need to repeat this work very often.

I investigated the NGS tools like bowtie or BWA to see if I could tweak them but it seems that the number of mismatches are limited to 3 and 1 respectively...this is a pity because it would have been great to have the results in a SAM format to do the follow up annotation step!

Could anyone advise me one a very fast aligner to do what I would like please?

Many thanks!

genome alignment • 4.7k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

did you play with the options of bwa ?

       -B INT        penalty for a mismatch [4]
(...)

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I got curious about this... Actually it seems that persuading bwa mem to align very short reads it's not easy... Here's a little test I performed on hg19 genome. One read is 20bp with no mismatches, the other has just 1 mismatch. I can't get bwa mem to align the second read!

mem genome.fa oligos.fa -T 0 -B 1
[M::main_mem] read 2 sequences (40 bp)...
[M::mem_process_seqs] Processed 2 reads in 0.000 CPU sec, 0.003 real sec
no_mism    0    chrM    1    6    20M    *    0    0    GATCACAGGTCTATCACCCT    *    NM:i:0    MD:Z:20    AS:i:20    XS:i:0
one    4    *    0    0    *    *    0    0    GCTCACAGGTCTATCACCCT    *    AS:i:0    XS:i:0

(the second read has 1 mismatch on the second base)

I set -B and -T to be very permissive but I had no luck in aligning both reads! (I played a bit also with -w, no better)

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by dariober 15k

0

Entering edit mode

Thanks for trying Dariober!

I do not know well BWA so I do not get the out put very well... but could it be that he only returns the best alignment?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

Yes, by default bwa returns the best alignment and in case of ties returns one of the best hits at random. The point here is that the second read is reported as unmapped while I expected to be mapped at the same location as the first one.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by dariober 15k

0

Entering edit mode

bwa is not a generic aligner, it was not designed to work for very short reads, bowtie does however work to some extent for miRNA sized reads (18-24bp)

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 101k

0

Entering edit mode

Indeed, I just thought with some massaging of the parameters it could be done... But as I said below, an NGS aligner might not be the best choice...

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by dariober 15k

0

Entering edit mode

Hello

Thanks for answering!

Would you have an idea of how to play with this parameter please? I do not see how I could control the number of mismatches from this...

Thanks a lot

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

Maybe NGS tools are not the way to go...

They just sound interesting in the first place. But apparently they are not designed to provide all possible alignment hits for a given sequence... (?)

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

I was going to suggest that NGS aligners are not the best choice... As a side note, keep in mind that allowing 5 mismatches in 20bp and asking for all the hits is probably going to generate millions of records in output.

One option I would try is to use a general alignment tool like vmatch.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by dariober 15k

0

Entering edit mode

The problem is that the more mismatches you allow in a short read the more diverse the read becomes and more locations it can align to. That is not what most fast mappers were designed to do, these aligners want to find the single best location. You can't have both fast alignments and very flexible ones. It is very unlikely that you could tweak these aligners to do what you want.

Maybe you could generate all possible mismatches with a program then look for exact matches with the aligner.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 101k

0

Entering edit mode

Thanks Albert! That is an idea. I would have to think about it though. It sounds that this would have a high complexity and I might gain time for the alignment but spend a lot of time generating all the possibilities. I would not know how to do this efficiently..

On a different note, I was looking at MegaBlast... would someone know this one. Could it help?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

It says:

Megablast: This program is optimized for aligning sequences that differ slightly

You are looking for 5 mismatches in 20 reads, that is 25% different, in sequencing that is actually a lot! plus it would be a local alignment, likely not what you are looking for.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 101k

0

Entering edit mode

Ok. Thanks for your answer! I will look for something else then. I am so surprised that nothing seems to exist to do that. It does not feel like something very special to do...

ADD REPLY • link 10.1 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

As the manual says, bwa-mem is not intended for very short reads. With bwa-aln, you need something like "bwa aln -n5 -k5 -o1". It will be very slow, but for 500 sequences, it should be find. Beware that you are likely to get a huge pile of random hits with 5 mismatches in a 20bp sequence.