Question

DNA sequences degeneration

0

Entering edit mode

8.4 years ago

Chadi Saad ▴ 110

What's the fastest way to degenerate exact DNA sequences (motifs of length L) using IUPAC ambiguity code. In other words, how to obtain all the possible IUPAC motifs from DNA exact motifs. I search for an efficient algorithm.

For example

I have these motifs of length 3 :

AAA
ATA
TTA
TAA
GAA
CAA

And I have to obtain:

AWA (AAA + ATA)
WWA (AAA + ATA + TAA + TTA)
RAA (AAA + GAA)
WAA (AAA + TAA)
YAA (CAA + TAA)
NAA (AAA + TAA + CAA + GAA)
...

DNA motif IUAPC • 2.7k views

ADD COMMENT • link updated 8.4 years ago by Brian Bushnell 20k • written 8.4 years ago by Chadi Saad ▴ 110

score 0 · Answer 1 · 2016-06-21

0

Entering edit mode

8.4 years ago

Devon Ryan 104k

Make a lookup table. For only two possibilities per site this becomes a 4x4 symmetric table.

ADD COMMENT • link 8.4 years ago by Devon Ryan 104k

0

Entering edit mode

thank you, can you give an example please ?

ADD REPLY • link 8.4 years ago by Chadi Saad ▴ 110

0

Entering edit mode

|   | A | C | T | G |
+---+---+---+---+---+
| A | A | M | W | R |
| C | M | A | Y | S |
| T | W | Y | T | K |
| G | R | S | K | G |

It seems that the table markdown doesn't render nicely. Anyway, there are many similar approaches, depending on your exact needs.

ADD REPLY • link 8.4 years ago by Devon Ryan 104k

0

Entering edit mode

but it means that each position is studied separately (a lookup table for each position), no ?

ADD REPLY • link 8.4 years ago by Chadi Saad ▴ 110

0

Entering edit mode

There's only one lookup table. Yes, each position is set separately, but that's not a problem and you can then generate these for arbitrary lengths efficiently.

ADD REPLY • link 8.4 years ago by Devon Ryan 104k

score 0 · Answer 2 · 2016-06-21

0

Entering edit mode

8.4 years ago

Brian Bushnell 20k

I would recommend translating to 4-bit with a linear lookup table, combining with OR, then using another linear lookup table to translate from 4-bit to IUPAC.

A -> 1
C -> 2
G -> 4
T -> 8

For example, the value at location of the ASCII value of 'A' should be 1.

So, A or C or T becomes:

1|2|8 = 11

The value at index 11 in the second lookup table is whatever the IUPAC symbol is for A or C or T. Specifically:

byte[] numberToIUPAC={
    'N','A','C','M','G','R','S','V', //0-7
    'T','W','Y','H','K','D','B','N', //8-15
};

ADD COMMENT • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

that's mean, that i have to compare each motif to all the other motifs, if the distance of hamming is equal to 1 (the 2 compared motifs differs by 1 letter) => get the corresponding IUPAC letter using your method ?

ADD REPLY • link 8.4 years ago by Chadi Saad ▴ 110

0

Entering edit mode

No. You process the first letter of all sequences at once. For example, if you have:

AAA
ACC
ACT
GCT

Then you get a motif like this:

First letter: 
A+A+A+G = 1|1|1|4 = 5 = R
Second letter:
A+C+C+C = 1|1|1|2 = 3 = M
Third letter:
A+C+T+T = 1|2|8|8 = 11 = H

So your motif is RMH.

Edit: I think, perhaps, I am not answering your original question, though. That's mainly because I cannot imagine a scenario in which it would be useful to obtain each degenerate sequence representing every possible subset of a set of sequences. What are you trying to accomplish?

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

for me each every possible exact sequence of a degenerated sequence should be represent. What I try to do, is to combine some over-represented exact motifs into some degenerated motifs. So it's not just a regular expression. thanks

ADD REPLY • link 8.4 years ago by Chadi Saad ▴ 110