I have a DNA sequence (of a chromosome) where each nucleotide has a score. For example (na indicates missing scores):
T C G A T T A G G A T C
0 1 10 20 5 na na 0 5 2 0 0
My goal is to select all subsequences with high energy scores. No any other a priori information is known except that subsequences are not expected to be long, approximately 5-15 nucleotides. Could someone suggest some ideas how to approach this?
Edit 1: The presented sequence of 12 positions is a snippet of a much longer nucleotide sequence, ~230K positions.
Edit 2: Answer found in "A linear time algorithm for finding all maximal scoring subsequences." Ruzzo-Tompa algorithm.
My first question would be: high relative to what? For instance, in your example, is 5 or 10 high?
That's one of the things which is not defined. That's why I'm looking for ideas/pointers. One thing is definite - there is some noise in the measurements. For example, score '1' or '2' could be a noise but noise level is not known.
If you can just read the 2 lines in as 2 arrays, I would first get some ballpark figures. If you can get the usual 5 part summaries, you can set some estimates about what are going to be reasonable thresholds.
I'd treat the whole thing as 2 arrays, or better yet probably a dict, and loop over it finding which index positions correspond to a particular value. You'd get sequences of positions which are, say > 10, or maybe >= the mean of the data, and the key of the key-value pair of the dict could be returned to get the sequences.
How do you expect to deal with
na
values? Do you intend to ignore/skip/impute them?For now I treat them the same as 0 to downplay the importance of those positions but I wonder if there are better ways to deal with them.