Before the analysis of sequences from DNA encoded library (DEL) screening. I need to firstly filter the unwanted sequences that do not contains the sequences we predefined. So I wonder is there any tools that can fulfill it?
I browsed the common alignment tools and find that most of them are used to align a short sequence to a large (human gene) one, which is not suitable for me.
My requirement:
I have a reference sequence, i.g. "ATCGCCG(10)CCGATTG", in which the first 7 bases are start primer and the last 7 bases are closing primer. The "(10)" means ten arbitrary bases encoded as tags.
For a query sequence B, i.g. "AAAAATCGCCGTTTTTTTTTTCCGATTGAAAA", it is supposed to be aligned to the reference sequence A as the following way:
A: ATCGCCGNNNNNNNNNNCCGATTG
||||||||||||||||||||||||
B: AAAAATCGCCGTTTTTTTTTTCCGATTGAAAA
Since "N" is an ambiguous base that can replace to any base, the sequence between A and B is completely aligned. We can score it 100%.
If some bases cannot be matched to the reference sequence, it will be punished:
A: ATCGCCGNNNNNNNNNNCCGATTG
||*|||||||||||||||||||||
C: AAAAATGGCCGTTTTTTTTTTCCGATTGAAAA
In this case, the score should only be 33/34 since the 7th base "G" of C cannot be matched to A.
Compared to some existed toolkit for DEL seq analysis that use regex-like methods, using alignment to preprocess the query sequences can help analyze more sequences and will make frequency analysis more precise I think. So, is there any tools that can do it?
Do you want to allow gaps? Ungapped alignment can be done really quickly since it's O(n) not O(n^2). It only takes a handful of clock cycles if you 4-bit encode each base and AVX vectorise the comparisons. If you only have a few pre-defined sequences then this may well be the fastest way.
Edit: you want to 4-bit encode your bases instead of 2-bit as you want to match with Ns. If you reads are always ACGT, then you can compare up to 128 bases (AVX-512) with just a bitwise
&
and apopcnt
.I've no idea about AVX. Anyhow thank you very much for your suggestion. Finally I defined the score matrix and set N-X(ACTG) the same points as X-X.