Hello,
I have been trying to understand how bowtie2 performs when you align a sequence against a short reference sequence (say a MID, adapter or primer). I am aware that bowtie2 is more oriented towards aligning reads against really long sequences, hence the seed heuristic. However when parameters are set right, it seems to do well for short reference sequences too except in some couple of extreme cases which I am trying to understand.
The parameters I use in all the alignments below are:
--ma 4 --mp 1,1 --rdg 2,2 --rfg 2,2 --local --score-min C,1 -N 1 -L 3 -i C,2 -a
As you can see, the seed parameters are rather extreme, length of 3 and constant steps of 2. In order get all the possible alignments I have also set -a and a very low score, though not meant for practical use.
As an example of a working case:
Case1: When I align the read
@seq173-CATAGTAGTG
AAAACAAGTAGTGCTGGAGTTTATCAATGAAGACTTCAATTGGGCTGGAGTCGCTCAGGTATGGGAACATAGTAGTG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
against reference CATAGTAGTG, I get the two alignments I look for. One is an exact match at the end, there for purposes of control. The second highest scoring is the one at the beginning with a deletion at the third position that I am looking for.
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSMMMMMMMMMM
SSSSMMDMMMMMMMSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
Case2: I remove the starting AAAA so the read is
@seq173-CATAGTAGTG
CAAGTAGTGCTGGAGTTTATCAATGAAGACTTCAATTGGGCTGGAGTCGCTCAGGTATGGGAACATAGTAGTG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
The reference sequence is the same. In this case I only get the first alignment correct. Despite having set the score to 1, I don't even get this alignment anywhere in the list of all alignments that bowtie2 produces. Despite this seeming like a seeding issue, I am having difficulty to understand how, since my seed lengths are very short and steps too. I have tried different options such as '-N 0 -L 1 -i C,1' or '-N 1 -L 2 -i C,1' non seems to help. Still since this seems like a seed issue I have also tried increasing parameters relating to seeding such as -D 100 -R 10. Although this should be an issue on how alignment penalties are set (since two reads should have the same alignment scores), I have tried setting gap open penalties to 0 as well and that does not help. I have also tried tinkering with the number buffer A's at the beginnig. It would seem like If it is 2 or less then I dont get the second alignment. If it is 3 or more I do.
What is the thing I am not understanding correctly here?
Thanks, I presume any aligner which basically does not use seeding (as in the terminology of bowtie) but instead searches the whole sequence for best possible sequence would be optimal? One such aligner that I know is edlib however it is not geared towards bioinformatics and therefore does not allow you to consider different weights for insertions, gaps or substitutions. It does however find the minimal edit distance alignment quite fast. Are there any those you could recommend which are "fast" (I guess they would somehow moving frame versions of the classical alignment algorithms such as NW etc). Thanks
I recommend framing the issue to describe the final goals you want to achieve - not how to make a tool do something.
If people understood the actual task, then would be able to comment on tools and their suitability than working backward as a troubleshooting task.
Well I want to basically identifty MIDs in sequences that come from illumina. By identify I mean, not only to tell which MIDs are in a sequence but also to get alignments.
I am aware that there are tools that already to the first part to various levels of accuracy (those that do it by exact match or those that allow subs, insertions, deletions etc). However I want a bit more finer control over this process in the sense that I want to get the alignments and not just the end result of demultiplexed files. Once I have the alignments, I generate reports and can diagnose problems which arise from time to time and requires a closer look at how well the MIDs align (for instance by doing this I had realised that the partly demultiplexed reads from the sequencer clips the first letter of some MIDs about %10 of the time which is discarded by demultiplexers that require exact matches).
Currently I am planning to use edlib, which only calculates edit distances and find alignments. However I would like to test being able to give different weights to match, substitutions, read and reference indels. That was the point why I initially went for bowtie2 which is very flexible in terms of these weights.
Thanks
I suspected that your problem is about something else. See how with Bowtie, you are aligning the entire read to a huge genome - but what you are really interested in aligning just short UMIs relative to one another. A whole different problem, for which Bowtire is not suited at all.
If you are restricting yourself to really short, similar sequences - then standard optimal methods will work really well - their problem was always scaling when mapping billions of much longer reads to whole genomes.
The exact solution would vary, but I would recommend first removing exact matches the study. That way you only need to deal with dissimilar barcodes. Then, I would say running an optimal aligner for those short sequences, I think it will operate faster than a bowtie lookup.
That was indeed my first approach, remove exact matches and then those with only substitutions (which are still relatively fast to identify). Then there remains only those with indels and gaps which are relatively few in percentage. But I have realized that Bowtie2 can identify these (exact matches or those with no indels) with complete accuracy when you set the seeding parameters precise enough. And it is still quite fast.
For the remaining, i.e those with indels, Bowtie2 actually does OK when the parameters are set correctly but then there are these occasional weird misses which can not be fixed and annoying hence why I was looking for something else. edlib is blazing fast and looks like it is optimal, though less flexibility in terms of scoring weights. I guess I will stick to that since there does not seem to be any other suggestions for now.
no there are pleny of fast optimal aligners (just that none are suited for large sequences)
pairwise2 in BioPython is fast: https://biopython.org/docs/1.76/api/Bio.pairwise2.html
Parasail: https://github.com/jeffdaily/parasail
and some I don't pop in mind at this moment. That being said edlib might be the best bet.
Thanks I did use pairwise2 before and agreed that it relatively fast. Though edlib is much faster to the extend that it would not require me to do exact matches separately to decrease the number of sequences. I will have a look at parasail too, if it is fast enough I think it can provide the flexibility that I need in terms of weights and such.
Oh I just remebered a tool I always wanted to try that might be suited for your needs:
https://seq-lang.org/
A python-like language that supposedly compiles to native-like performance:
https://docs.seq-lang.org/cookbook.html#sequence-alignment
supposedly it offers built-in parallelism as well.