Any tools that can score for alignment of two short sequences allowing ambiguous base
1
0
Entering edit mode
6.5 years ago
Zealseeker • 0

Before the analysis of sequences from DNA encoded library (DEL) screening. I need to firstly filter the unwanted sequences that do not contains the sequences we predefined. So I wonder is there any tools that can fulfill it?

I browsed the common alignment tools and find that most of them are used to align a short sequence to a large (human gene) one, which is not suitable for me.

My requirement:

I have a reference sequence, i.g. "ATCGCCG(10)CCGATTG", in which the first 7 bases are start primer and the last 7 bases are closing primer. The "(10)" means ten arbitrary bases encoded as tags.

For a query sequence B, i.g. "AAAAATCGCCGTTTTTTTTTTCCGATTGAAAA", it is supposed to be aligned to the reference sequence A as the following way:

A:     ATCGCCGNNNNNNNNNNCCGATTG
       ||||||||||||||||||||||||
B: AAAAATCGCCGTTTTTTTTTTCCGATTGAAAA

Since "N" is an ambiguous base that can replace to any base, the sequence between A and B is completely aligned. We can score it 100%.

If some bases cannot be matched to the reference sequence, it will be punished:

A:     ATCGCCGNNNNNNNNNNCCGATTG
       ||*|||||||||||||||||||||
C: AAAAATGGCCGTTTTTTTTTTCCGATTGAAAA

In this case, the score should only be 33/34 since the 7th base "G" of C cannot be matched to A.

Compared to some existed toolkit for DEL seq analysis that use regex-like methods, using alignment to preprocess the query sequences can help analyze more sequences and will make frequency analysis more precise I think. So, is there any tools that can do it?

alignment sequence gene • 1.3k views
ADD COMMENT
0
Entering edit mode

Do you want to allow gaps? Ungapped alignment can be done really quickly since it's O(n) not O(n^2). It only takes a handful of clock cycles if you 4-bit encode each base and AVX vectorise the comparisons. If you only have a few pre-defined sequences then this may well be the fastest way.

Edit: you want to 4-bit encode your bases instead of 2-bit as you want to match with Ns. If you reads are always ACGT, then you can compare up to 128 bases (AVX-512) with just a bitwise & and a popcnt.

ADD REPLY
0
Entering edit mode

I've no idea about AVX. Anyhow thank you very much for your suggestion. Finally I defined the score matrix and set N-X(ACTG) the same points as X-X.

ADD REPLY
1
Entering edit mode
6.5 years ago

You should be able to do this with blastn setting the window size and other parameters appropriately or maybe with the option -task blastn-short. Otherwise, you can always use either the Needleman-Wunsch or the Smith-Waterman alignment algorithms. They are available as needle and water in the EMBOSS suite.

ADD COMMENT
0
Entering edit mode

Hi, thank you for you reply. I am going to use water, but I don't know how to set "known gap", i.e. the placeholder for ambiguous bases. As introduced in the post, the tags are arbitrary so that I don't want them to be considered during scoring. The following is the result I tried, which seems not good:

EMBOSS_001         1 ATCGCCGNNNNNNNNNN----------CCGATTG     24
                     |||||||                    |||||||
EMBOSS_001         5 ATCGCCG----------TTTTTTTTTTCCGATTG     28
ADD REPLY
2
Entering edit mode

You need to set your scoring matrix to consider A/C/G/T as a match to N or at the very least, no mismatch penalty. You'll need to consult the documentation of your alignment tools and not all of them support this.

ADD REPLY
1
Entering edit mode

Yes. water uses the DNAFULL matrix by default for nucleotides, a custom one can be supplied with the option -datafile matrixf.

ADD REPLY

Login before adding your answer.

Traffic: 2279 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6