High score DNA motif search
1
0
Entering edit mode
7.4 years ago

I have a DNA sequence (of a chromosome) where each nucleotide has a score. For example (na indicates missing scores):

T  C  G   A  T  T  A  G  G  A  T  C
0  1  10  20 5  na na 0  5  2  0  0

My goal is to select all subsequences with high energy scores. No any other a priori information is known except that subsequences are not expected to be long, approximately 5-15 nucleotides. Could someone suggest some ideas how to approach this?

Edit 1: The presented sequence of 12 positions is a snippet of a much longer nucleotide sequence, ~230K positions.

Edit 2: Answer found in "A linear time algorithm for finding all maximal scoring subsequences." Ruzzo-Tompa algorithm.

sequence genome DNA R python • 1.8k views
ADD COMMENT
0
Entering edit mode

My first question would be: high relative to what? For instance, in your example, is 5 or 10 high?

ADD REPLY
0
Entering edit mode

That's one of the things which is not defined. That's why I'm looking for ideas/pointers. One thing is definite - there is some noise in the measurements. For example, score '1' or '2' could be a noise but noise level is not known.

ADD REPLY
0
Entering edit mode

If you can just read the 2 lines in as 2 arrays, I would first get some ballpark figures. If you can get the usual 5 part summaries, you can set some estimates about what are going to be reasonable thresholds.

I'd treat the whole thing as 2 arrays, or better yet probably a dict, and loop over it finding which index positions correspond to a particular value. You'd get sequences of positions which are, say > 10, or maybe >= the mean of the data, and the key of the key-value pair of the dict could be returned to get the sequences.

ADD REPLY
0
Entering edit mode

How do you expect to deal with na values? Do you intend to ignore/skip/impute them?

ADD REPLY
0
Entering edit mode

For now I treat them the same as 0 to downplay the importance of those positions but I wonder if there are better ways to deal with them.

ADD REPLY
0
Entering edit mode
7.4 years ago

What you might to be looking for are solutions to the maximum subarray problem for subarrays between 5 and 15 elements in length. It's not really clear what data you're working with, however, and this doesn't make much sense for your 12-base sequence. Maybe you're just showing a small snippet of data. It's not really clear from your question.

In any case, you need to decide how you deal with na values. If 0 is a score with specific meaning, then applying 0 to na values may not make sense. You might impute the value by taking the mean of values over nearby windows, or replace them with mean or median values over background. Or you might skip over such regions entirely. This really depends on your score data and its meaning.

ADD COMMENT

Login before adding your answer.

Traffic: 2010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6