I have a long list of very short sequences (6-10nt long) that have to be aligned against a very small part of a mammalian genome (2-5Mb).
The region is not continuous but derived from a few thousand smaller fragments. The index file/database is created as the collection of these individual fragments (not concatenated).
The problem is that mismatches + indels (up to two per sequence) are expected and I want to keep all the multimapped positions.
Tha aligner should be quite fast and as exhaustive as possbile.
Already tried bwa, novoalign, blast, gem, bowtie2 and soap + dynamic programming (which works but is very slow).
All the tested aligners have different problems with this task.
Until now bowtie2 seems to be more thorough but takes significant time to run in local mode.
Any suggestions? Since the sequences are quite short the aligner does not have to be derived only from the NGS cosmos.
Some further info based on the received comments:
The sequences are binding conformations and the index sequences are bona fide binding sites. Therefore there are no false positives and we don't expect unique alignments for each sequence.
The optimal aligner should return results residing in the best stratum: i.e. if there are 50 matches with no mismatches then there is no need to report results with e.g. 2 mismatches and an indel.
The number of returned results is not a problem but is actually desired. Therefore the aligner should report all multimaps belonging to the same stratum.
The identified sites will be filtered based on site properties and the true sites can be identified. However, you need the candidate regions to do the filtering.
We have found aligners that can partly solve the problem but some do not report all multimaps, others have issues with indels or are too slow.
We have started coding an in-house aligner for the task but it would save time if there is something ready out there.
--Edit2 based on comments
I would like to thank fellow biostars for trying to point us in another way and help us avoid a mistake or loss of time.
However what has been asked is exactly what we're looking for: "Does anyone know of a (fast) aligner that can handle (many) multimaps and indels for short sequences".
The indexed regions are not many and quite small. Even if a sequence maps in multiple positions, it's not a problem because they can be filtered afterwards. Nevertheless, what is important is to get those positions, despite their relatively large number. We're looking for an aligner that can manage such a task (despite its uniqueness) and that can do it as fast and as exhaustively as possible.
I don't know of a fast aligner that can solve this problem. But, BBDuk can sort-of solve it using kmer matching. It does not produce sam output, but it can mask all locations in the reference that match your criteria (by changing them to lowercase or to a custom symbol), and you could later use a simple script to parse those locations. I'm not sure if this addresses your problem. But, for example, if you had these two files -
genome.fa:
short.fa:
Then you could run this:
...and the output would be:
You can set
edist=2
to allow an edit distance of 2. It will do this very quickly and absolutely exhaustively, with no heuristics applied. If you want to be able to tell which features are being masked by which sequence, though, you'd have to run it once per sequence. It's fast enough that that won't really be a problem, though, just annoying.Brian BBDuk seems really impressive (for this issue and also for read decontamination). I will definitely check it out.
Is edit distance penalty the same for mismatches and indels?
Yes, there's no penalty for either. But, actually, I need to correct what I said - I forgot that it does not currently handle more than 1 deletion. So, for edit distance 2, it handles:
1 mismatch
1 insertion
1 deletion
2 mismatches
1 mismatch + 1 deletion
1 mismatch + 1 insertion
2 insertions
1 insertion + 1 deletion
but not 2 deletions.
I might add that in the future, though, just for completeness. Incidentally, at edit distance 3, it can do 2 deletions and 1 insertion. I generally don't use edit distance anyway, just hamming distance, which is more relevant for decontamination of Illumina reads.
After a few days of testing, BBDuk seems to be the fastest solution and it does return tons of results more than anything else we've tried (which is a good thing). We get as many hits as with a classic dynamic programming approach.
The only issue with this solution is that you don't get where and which kmer of your larger input sequence (for instance a 10nt input vs a 30nt sequence, with a 7mer search) binds. In some cases you get too many lower case results.
Therefore I have to run a local alignment algorithm in the reduced search space following bbduk which increases run time. Is there a way to keep the information of where has a kmer bound (and which) directly from BBDuk?
The only way to get more information, currently, is to do multiple runs. e.g., Make a set of all your 30bp sequences, and run BBDuk with k=30; then make a set of 29bp sequences, and run BBDuk with K=29, etc. This will at least eliminate the problem of excessive masking.
I could modify BBDuk (or actually its successor, Seal) to produce sam files, or something similar, indicating which reference sequence hits which location. I hadn't really thought about it because you're using it backwards from the way I normally use it, but I'll look into it.
That's what we're doing at the moment (multiple runs) but it takes longer to run.
A SAM or SAM-type output would be more than ideal.
Any luck in adding SAM support to seal/BBDuk? :)
Not yet :) It's on my list!
Even for a 10-mer, my rough estimate is that you will see ~2000 random 2-diff hits in a 2Mbp region. Tens of thousands of false hits for a 6-mer. I would first try my filtering strategy on some quick-and-dirty mapping. If I saw no signal, no point to seek the best possible mapping.
As Brian says, it'd be easier to help here if you could explain the biological motivation behind this questio