How Should I Encode Dna Into A Piddle?
1
2
Entering edit mode
12.9 years ago
Flies ▴ 100

I'm going to be doing some non-linear regression (with a huge and messy residual function), and I am thinking of using PDL::Fit::LM (I had some trouble getting Levmar to install).

The explanatory variables for my fit are DNA sequence (which I'm feeding into a position-specific weight-matrix). What's the easiest way to put a DNA sequence into a piddle? Given that the function i'm working with is a big mess, performance is a consideration.

Since my weight-matrix is constrained so that the sum of weights at a given position comes to zero, my plan is currently to represent each nucleotide as a vector of three elements A -> [1,0,0], C -> [0,1,0], G -> [0,0,1], T -> [-1,-1,-1]. This way I can take a subsequence of my total sequence and just multiply it with my weight-matrix and get the score.

perl • 2.7k views
ADD COMMENT
0
Entering edit mode

+1 for the most amusing BioStar title to date.

ADD REPLY
0
Entering edit mode

What's your question? Seems like you've answered it yourself.

ADD REPLY
0
Entering edit mode

Others have successfully used PDL for encoding alignments and other DNA related stuff. Too bad the PDL documentation terrible

ADD REPLY
0
Entering edit mode

@qdjm I'm just guessing that I'm not the first person to do this, and I'm wondering what solutions people have come up with. I mention my current idea as a point of reference.

ADD REPLY
1
Entering edit mode
12.9 years ago
Michael 55k

If I did understand your question correctly you want to store the nucleotide sequence in a PDL data structure, is that correct? If not, then please update your question, it is a bit confusing. I do not immediately see the advantage of doing this, instead of sticking with a normal string. The question is then, why would you want to do this? Anyway, you could eventually use PDL::Char

something along the lines

use PDL;
use PDL::Char;
my $pchar = PDL::Char->new( ['ACGT', 'ATGT', 'TGAA']);

As you don't have control over the storage size of a variable (could think of using a 2bit encoded format, but there is no bit-pdl) this might already be the most efficient way meomory-wise.

ADD COMMENT
0
Entering edit mode

As to the reason why, it's because I have a quantitative model that uses DNA sequence as input.

ADD REPLY
0
Entering edit mode

As to the reason why, it's because I have a quantitative model that uses DNA sequence as input, and I want to do the calculation as efficiently as possible.

ADD REPLY

Login before adding your answer.

Traffic: 2471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6