Question

Any tools that can score for alignment of two short sequences allowing ambiguous base

0

Entering edit mode

7.2 years ago

Zealseeker • 0

Before the analysis of sequences from DNA encoded library (DEL) screening. I need to firstly filter the unwanted sequences that do not contains the sequences we predefined. So I wonder is there any tools that can fulfill it?

I browsed the common alignment tools and find that most of them are used to align a short sequence to a large (human gene) one, which is not suitable for me.

My requirement:

I have a reference sequence, i.g. "ATCGCCG(10)CCGATTG", in which the first 7 bases are start primer and the last 7 bases are closing primer. The "(10)" means ten arbitrary bases encoded as tags.

For a query sequence B, i.g. "AAAAATCGCCGTTTTTTTTTTCCGATTGAAAA", it is supposed to be aligned to the reference sequence A as the following way:

A:     ATCGCCGNNNNNNNNNNCCGATTG
       ||||||||||||||||||||||||
B: AAAAATCGCCGTTTTTTTTTTCCGATTGAAAA

Since "N" is an ambiguous base that can replace to any base, the sequence between A and B is completely aligned. We can score it 100%.

If some bases cannot be matched to the reference sequence, it will be punished:

A:     ATCGCCGNNNNNNNNNNCCGATTG
       ||*|||||||||||||||||||||
C: AAAAATGGCCGTTTTTTTTTTCCGATTGAAAA

In this case, the score should only be 33/34 since the 7th base "G" of C cannot be matched to A.

Compared to some existed toolkit for DEL seq analysis that use regex-like methods, using alignment to preprocess the query sequences can help analyze more sequences and will make frequency analysis more precise I think. So, is there any tools that can do it?

alignment sequence gene • 1.6k views

ADD COMMENT • link updated 7.2 years ago by Jean-Karim Heriche 27k • written 7.2 years ago by Zealseeker • 0

0

Entering edit mode

Do you want to allow gaps? Ungapped alignment can be done really quickly since it's O(n) not O(n^2). It only takes a handful of clock cycles if you 4-bit encode each base and AVX vectorise the comparisons. If you only have a few pre-defined sequences then this may well be the fastest way.

Edit: you want to 4-bit encode your bases instead of 2-bit as you want to match with Ns. If you reads are always ACGT, then you can compare up to 128 bases (AVX-512) with just a bitwise & and a popcnt.

ADD REPLY • link 7.2 years ago by d-cameron ★ 3.0k

0

Entering edit mode

I've no idea about AVX. Anyhow thank you very much for your suggestion. Finally I defined the score matrix and set N-X(ACTG) the same points as X-X.

ADD REPLY • link 7.2 years ago by Zealseeker • 0

score 1 · Answer 1 · 2018-05-15

1

Entering edit mode

7.2 years ago

Jean-Karim Heriche 27k

You should be able to do this with blastn setting the window size and other parameters appropriately or maybe with the option -task blastn-short. Otherwise, you can always use either the Needleman-Wunsch or the Smith-Waterman alignment algorithms. They are available as needle and water in the EMBOSS suite.

ADD COMMENT • link 7.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi, thank you for you reply. I am going to use water, but I don't know how to set "known gap", i.e. the placeholder for ambiguous bases. As introduced in the post, the tags are arbitrary so that I don't want them to be considered during scoring. The following is the result I tried, which seems not good:

EMBOSS_001         1 ATCGCCGNNNNNNNNNN----------CCGATTG     24
                     |||||||                    |||||||
EMBOSS_001         5 ATCGCCG----------TTTTTTTTTTCCGATTG     28

ADD REPLY • link 7.2 years ago by Zealseeker • 0

2

Entering edit mode

You need to set your scoring matrix to consider A/C/G/T as a match to N or at the very least, no mismatch penalty. You'll need to consult the documentation of your alignment tools and not all of them support this.

ADD REPLY • link 7.2 years ago by d-cameron ★ 3.0k

1

Entering edit mode

Yes. water uses the DNAFULL matrix by default for nucleotides, a custom one can be supplied with the option -datafile matrixf.

ADD REPLY • link 7.2 years ago by Jean-Karim Heriche 27k