Question

Aligner for super-short sequences

0

Entering edit mode

9.2 years ago

IV ★ 1.3k

I have a long list of very short sequences (6-10nt long) that have to be aligned against a very small part of a mammalian genome (2-5Mb).

The region is not continuous but derived from a few thousand smaller fragments. The index file/database is created as the collection of these individual fragments (not concatenated).

The problem is that mismatches + indels (up to two per sequence) are expected and I want to keep all the multimapped positions.

Tha aligner should be quite fast and as exhaustive as possbile.

Already tried bwa, novoalign, blast, gem, bowtie2 and soap + dynamic programming (which works but is very slow).

All the tested aligners have different problems with this task.

Until now bowtie2 seems to be more thorough but takes significant time to run in local mode.

Any suggestions? Since the sequences are quite short the aligner does not have to be derived only from the NGS cosmos.

Some further info based on the received comments:

The sequences are binding conformations and the index sequences are bona fide binding sites. Therefore there are no false positives and we don't expect unique alignments for each sequence.

The optimal aligner should return results residing in the best stratum: i.e. if there are 50 matches with no mismatches then there is no need to report results with e.g. 2 mismatches and an indel.

The number of returned results is not a problem but is actually desired. Therefore the aligner should report all multimaps belonging to the same stratum.

The identified sites will be filtered based on site properties and the true sites can be identified. However, you need the candidate regions to do the filtering.

We have found aligners that can partly solve the problem but some do not report all multimaps, others have issues with indels or are too slow.

We have started coding an in-house aligner for the task but it would save time if there is something ready out there.

--Edit2 based on comments

I would like to thank fellow biostars for trying to point us in another way and help us avoid a mistake or loss of time.

However what has been asked is exactly what we're looking for: "Does anyone know of a (fast) aligner that can handle (many) multimaps and indels for short sequences".

alignment • 4.0k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by IV ★ 1.3k

Ram · Answer 1 · 2015-08-22

3

Entering edit mode

9.2 years ago

Brian Bushnell 20k

I'm not sure that 6bp sequences will be at all useful, particularly with indels. A 6bp sequence with 2 edits would map virtually everywhere. What exactly are you trying to do?

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

The indexed regions are not many and quite small. Even if a sequence maps in multiple positions, it's not a problem because they can be filtered afterwards. Nevertheless, what is important is to get those positions, despite their relatively large number. We're looking for an aligner that can manage such a task (despite its uniqueness) and that can do it as fast and as exhaustively as possible.

ADD REPLY • link 9.2 years ago by IV ★ 1.3k

1

Entering edit mode

I don't know of a fast aligner that can solve this problem. But, BBDuk can sort-of solve it using kmer matching. It does not produce sam output, but it can mask all locations in the reference that match your criteria (by changing them to lowercase or to a custom symbol), and you could later use a simple script to parse those locations. I'm not sure if this addresses your problem. But, for example, if you had these two files -

genome.fa:

>1
AAACCCTTTGGAAACCCTTTTTTTGG

short.fa:

>1
AAACCC

Then you could run this:

bbduk.sh in=genome.fa ref=short.fa out=masked.fa kmask=lc k=6 mm=f

...and the output would be:

>1
aaacccTTTGGaaacccTTTTTTTGG

You can set edist=2 to allow an edit distance of 2. It will do this very quickly and absolutely exhaustively, with no heuristics applied. If you want to be able to tell which features are being masked by which sequence, though, you'd have to run it once per sequence. It's fast enough that that won't really be a problem, though, just annoying.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Brian BBDuk seems really impressive (for this issue and also for read decontamination). I will definitely check it out.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by IV ★ 1.3k

0

Entering edit mode

Is edit distance penalty the same for mismatches and indels?

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by IV ★ 1.3k

0

Entering edit mode

Yes, there's no penalty for either. But, actually, I need to correct what I said - I forgot that it does not currently handle more than 1 deletion. So, for edit distance 2, it handles:

1 mismatch

1 insertion

1 deletion

2 mismatches

1 mismatch + 1 deletion

1 mismatch + 1 insertion

2 insertions

1 insertion + 1 deletion

but not 2 deletions.

I might add that in the future, though, just for completeness. Incidentally, at edit distance 3, it can do 2 deletions and 1 insertion. I generally don't use edit distance anyway, just hamming distance, which is more relevant for decontamination of Illumina reads.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

After a few days of testing, BBDuk seems to be the fastest solution and it does return tons of results more than anything else we've tried (which is a good thing). We get as many hits as with a classic dynamic programming approach.

The only issue with this solution is that you don't get where and which kmer of your larger input sequence (for instance a 10nt input vs a 30nt sequence, with a 7mer search) binds. In some cases you get too many lower case results.

Therefore I have to run a local alignment algorithm in the reduced search space following bbduk which increases run time. Is there a way to keep the information of where has a kmer bound (and which) directly from BBDuk?

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by IV ★ 1.3k

1

Entering edit mode

The only way to get more information, currently, is to do multiple runs. e.g., Make a set of all your 30bp sequences, and run BBDuk with k=30; then make a set of 29bp sequences, and run BBDuk with K=29, etc. This will at least eliminate the problem of excessive masking.

I could modify BBDuk (or actually its successor, Seal) to produce sam files, or something similar, indicating which reference sequence hits which location. I hadn't really thought about it because you're using it backwards from the way I normally use it, but I'll look into it.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

That's what we're doing at the moment (multiple runs) but it takes longer to run.

A SAM or SAM-type output would be more than ideal.

ADD REPLY • link 9.2 years ago by IV ★ 1.3k

0

Entering edit mode

Any luck in adding SAM support to seal/BBDuk? :)

ADD REPLY • link 9.1 years ago by IV ★ 1.3k

0

Entering edit mode

Not yet :) It's on my list!

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Even for a 10-mer, my rough estimate is that you will see ~2000 random 2-diff hits in a 2Mbp region. Tens of thousands of false hits for a 6-mer. I would first try my filtering strategy on some quick-and-dirty mapping. If I saw no signal, no point to seek the best possible mapping.

ADD REPLY • link 9.2 years ago by lh3 33k

0

Entering edit mode

As Brian says, it'd be easier to help here if you could explain the biological motivation behind this questio

ADD REPLY • link 9.2 years ago by aidan-budd 1.9k

Ram · Answer 2 · 2015-08-23

2

Entering edit mode

9.2 years ago

fred.s.kremer ▴ 110

Have you tried SSAHA?

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.2 years ago by fred.s.kremer ▴ 110

0

Entering edit mode

One of our two in-house-developed algorithms is similar to ssaha. I will check out the paper and the algorithm to see if it can be of assistance. Thanks for the pointer!

ADD REPLY • link 9.2 years ago by IV ★ 1.3k

Ram · Answer 3 · 2015-08-23

I second the suggestion others have given about better defining the biological question. But for the moment I'd suggest vmatch. It should be fast and sensitive enough. For example:

Index the reference, use -pl small enough to map small queries:

mkvtree -db ref.fa -dna -pl 4 -allout -v

Then map queries; allow alignments to differ by up to 10% and with evalue up to 10:

vmatch -d -p -showdesc 0 -e 10p -evalue 10 -complete -q queries.fa ref.fa

score 0 · Answer 4 · 2015-08-22

0

Entering edit mode

9.2 years ago

andrew.j.skelton73 6.6k

If I remember right, bowtie (the original), was designed for pretty short reads. Your other option is probably Velvet - Any reason why your reads are so short?

ADD COMMENT • link 9.2 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

Unfortunately bowtie1 cannot handle indels. The experimental design imposes the very short sequence length. Unfortunately it cannot be changed.

ADD REPLY • link 9.2 years ago by IV ★ 1.3k

score 0 · Answer 5 · 2015-08-23

0

Entering edit mode

9.2 years ago

mikhail.shugay 3.5k

Are you looking for some motifs, e.g. TFBS in gene promoters? If yes, then you can use MEME suite or similar application for it. +1 to Brian's answer, it looks like we have some sort of XY problem here.