Question

How Should I Encode Dna Into A Piddle?

2

Entering edit mode

12.9 years ago

Flies ▴ 100

I'm going to be doing some non-linear regression (with a huge and messy residual function), and I am thinking of using PDL::Fit::LM (I had some trouble getting Levmar to install).

The explanatory variables for my fit are DNA sequence (which I'm feeding into a position-specific weight-matrix). What's the easiest way to put a DNA sequence into a piddle? Given that the function i'm working with is a big mess, performance is a consideration.

Since my weight-matrix is constrained so that the sum of weights at a given position comes to zero, my plan is currently to represent each nucleotide as a vector of three elements A -> [1,0,0], C -> [0,1,0], G -> [0,0,1], T -> [-1,-1,-1]. This way I can take a subsequence of my total sequence and just multiply it with my weight-matrix and get the score.

perl • 2.7k views

ADD COMMENT • link updated 12.9 years ago by Michael 55k • written 12.9 years ago by Flies ▴ 100

0

Entering edit mode

+1 for the most amusing BioStar title to date.

ADD REPLY • link 12.9 years ago by Casey Bergman 18k

0

Entering edit mode

What's your question? Seems like you've answered it yourself.

ADD REPLY • link 12.9 years ago by Qdjm 1.9k

0

Entering edit mode

Others have successfully used PDL for encoding alignments and other DNA related stuff. Too bad the PDL documentation terrible

ADD REPLY • link 12.9 years ago by Martin A Hansen 3.0k

0

Entering edit mode

@qdjm I'm just guessing that I'm not the first person to do this, and I'm wondering what solutions people have come up with. I mention my current idea as a point of reference.

ADD REPLY • link 12.9 years ago by Flies ▴ 100

score 1 · Answer 1 · 2011-12-17

1

Entering edit mode

12.9 years ago

Michael 55k

If I did understand your question correctly you want to store the nucleotide sequence in a PDL data structure, is that correct? If not, then please update your question, it is a bit confusing. I do not immediately see the advantage of doing this, instead of sticking with a normal string. The question is then, why would you want to do this? Anyway, you could eventually use PDL::Char

something along the lines

use PDL;
use PDL::Char;
my $pchar = PDL::Char->new( ['ACGT', 'ATGT', 'TGAA']);

As you don't have control over the storage size of a variable (could think of using a 2bit encoded format, but there is no bit-pdl) this might already be the most efficient way meomory-wise.

ADD COMMENT • link 12.9 years ago by Michael 55k

0

Entering edit mode

As to the reason why, it's because I have a quantitative model that uses DNA sequence as input.

ADD REPLY • link 12.9 years ago by Flies ▴ 100

0

Entering edit mode

As to the reason why, it's because I have a quantitative model that uses DNA sequence as input, and I want to do the calculation as efficiently as possible.

ADD REPLY • link 12.9 years ago by Flies ▴ 100