Question

Finding Orf And Corresponding Strand In A Dna Sequence

11

Entering edit mode

13.7 years ago

Eric Normandeau 11k

Hi all,

It's been too long since we've had a Code golf! Following Pierre's suggestion in this question, here is a code golf about finding ORFs :)

So, here's the problem:

Write a program that finds the longuest ORF (closed or open) in a DNA sequence and returns the beginning and ending positions of the ORF, as well as the frame shift (-1, -2, -3, +1, +2, +3). Use the Standard Genetic Code. Note: the start/end positions are for the DNA sequence, not the corresponding amino acid sequence. In case there are no stop codons, return a +1 frame shift and the begining and end positions of the sequence.

As previously, you can use anything that can be run on a Linux terminal (your favorite language, emboss, awk...). The main interests of code golf question are to:

See diversity of approaches
Take on a small challenge
Show off :)
Most importantly: have fun!

Here is an example run: find_orf("ACGTACGTACGTACGT") Should return:

frame: +1
ORF_start: 1
ORF_end: 16

Or, in another format:

"+1 1 16"

You can use this test file to test the output of your program.

EDIT: It appears that the challenge is a bit less simple than I thought it would be... I'll put a one week extension before giving the correct answer. Please don't hesitate to put anything you would use from the command line, including things like a command-line version of ORFinder or the such.

The correct answer will be given to the most popular answer on Thursday, March 3rd at 15:00 hour (US Eastern Time).

EDIT: Answer goes to Brad and the Clojure solution :) Nice to see some real life examples in Lisp! Could you plug a recursive function in there? :P It seems that this problem was not that small after all, so many thanks to all who contributed answers!

Cheers!

code orf dna codegolf • 11k views

ADD COMMENT • link updated 2.8 years ago by Michael 55k • written 13.7 years ago by Eric Normandeau 11k

0

Entering edit mode

How many different genetic codes should this support all 16+? because then already the code tables become enourmous and you are better off using an existing tool like getOrf.

ADD REPLY • link 13.7 years ago by Michael 55k

0

Entering edit mode

Hi Michael. I edited the question so that only the Standard Genetic Code be used.

ADD REPLY • link 13.7 years ago by Eric Normandeau 11k

0

Entering edit mode

What about selenocysteine/pyrrolysine?

ADD REPLY • link 13.7 years ago by Jarretinha 3.4k

0

Entering edit mode

As this is a code golf, and not an attempt to make a complete program that covers all the special cases, I'd recommend to stick with the Standard Genetic Code and the 3 standard stop codons. The aim is to show your approach in your favorite language and share it. Cheers

ADD REPLY • link 13.7 years ago by Eric Normandeau 11k

0

Entering edit mode

Any chance of another 1 week extension? I'd like to tackle this one, but have not had the time.

ADD REPLY • link 13.7 years ago by Neilfws 49k

0

Entering edit mode

Hi Neil. The extension currently in effect would still leave you one week (until March 10th). I also have an almost finished version, but don't find much time either :) Is this extension long enough for your needs? I wouldn't mind reporting it one week farther, but I don't want it to become a running joke and do it until the end of civilization by the end of 2012 :D

ADD REPLY • link 13.7 years ago by Eric Normandeau 11k

0

Entering edit mode

You want to report the 1-base coordinates of the start of the first codon, and the start of the first "non-codon", i.e. first nucleotide after the reading frame? And for reverse? And do you want to include any terminating stop codon in the ORF? I think some of the test sequences have multiple ORFs of the same length, could you also provide the expected output (or all valid outputs) for the test sequences?

ADD REPLY • link 13.7 years ago by Ketil 4.1k

0

Entering edit mode

Ketil, I came up with those test sequences as just something short to play with. Due to being toy examples, they do have multiple ORFs of the same size. In the case of ties, I just reported the first frame ordered as (1, 2, 3, -1, -2, -3).

ADD REPLY • link 13.7 years ago by Brad Chapman 9.7k

0

Entering edit mode

Ketil, I came up with those short test examples by hand. They do have multiple ORFs of the same size and I reported the first frame in (1, 2, 3, -1, -2, -3) order. They are more useful to double check everything works in all 6 coordinates than to simulate real data.

ADD REPLY • link 13.7 years ago by Brad Chapman 9.7k

0

Entering edit mode

Almost 11 years later, maybe a simple question, how come that no one used regular expression to retrieve positions of start and stop codons? this way we don't need the frame shift..right? I tested it with R and all I had to do is to construct the ORF from the obtained matrix of inedices.

ADD REPLY • link 2.8 years ago by H.Hasani ▴ 990

1

Entering edit mode

Because simply finding the potential start and stop codons is not enough. They also need to be in frame, meaning the distance between them is divisible by 3. This requirement cannot be implemented in regular expressions. To check this, however, makes the problem more computationally complex than it needs to be. Indeed, it becomes quadratically dependent on the number of start and stop codons, while it could be at most linear depending on the length of the sequence. Or in O notation, l: length of sequence: memory and run-time complexity (l) = l + l^2 = O(l^2)

The quadratic nature of this is implicit in what you wrote about the matrix:

all I had to do is to construct the ORF from the obtained matrix of indices

Using a matrix indicates that the problem is quadratic now, and "all you have to do" remains little work only in the case of few codons found. Imagine a worst-case scenario where the sequence consists only of start and stop codons.

ADD REPLY • link 2.8 years ago by Michael 55k

Ram · Answer 1 · 2011-02-28

Here's a Clojure solution using BioJava 3 for FASTA reading and translation. The code is in GitHub along with a small test file that has ORFs in all 6 frames, an internal ORF, and a longer EST sequence without a stop codon.

% lein run :findorf test.fa
t-1 [0 15 ONE]
t-2 [1 16 TWO]
t-3 [2 17 THREE]
t-r1 [0 15 REVERSED_ONE]
t-r2 [0 18 REVERSED_TWO]
t-r3 [0 15 REVERSED_THREE]
t-1-inner [3 21 ONE]
EX720612.1 [0 252 ONE]

For each of the 6 frames:

Translate the DNA to protein
Split protein translation at stops
Find the longest protein region
Remap protein coordinates back to DNA bases
Remap DNA coordinates to the correct frame

Finally, choose a longest frame out of the 6 and print the details.

Edit: For those who are interested in digging more into the code but find the functional style difficult, the tx-sqn-frame function is a good place to start. It begins by defining a function that take a DNA sequence and the frame to translate it in:

(defn tx-sqn-frame [dna frame]

The next step is to setup various functions and objects you need to the work. I think of this as the preamble to the actual code. This is done with the 'let' function which assigns objects or functions to names. The first part demonstrates Clojure's Java interoperability. Here we create a BioJava TranscriptionEngine Builder object, tell it to trim stop codons, then build a TranscriptionEngine:

  (let [tx-engine (-> (TranscriptionEngine$Builder.)
                      (.trimStop false)
                      (.build))

The next let statement creates a local function called dna-coords that converts a set of protein coordinates to DNA; just multiplying by 3. This is just a function and could be declared anywhere in the code; in this case I declare it locally since it's small and not used elsewhere:

        dna-coords (fn [coords]
                     (let [dna-start (* 3 (first coords))]
                       [dna-start (+ dna-start (* 3 (second coords)))]))]

Once the builder and local functions are setup, you get to the interesting part of the code that does the actual logic. The wonderful thing about Clojure here is that you almost completely remove any boilerplate associated with calling functions. The code reads almost like pseudocode. The main thing to know here in terms of syntax is that the -> macro passes the results of one function call to the next:

    (-> (.getRNASequence dna tx-engine frame)
        (.getProteinSequence tx-engine)
        .toString
        (str2/partition #"\*")
        longest-region
        dna-coords
        (frame-coords (.getLength dna) (str frame)))))

So this:

Transcribes the DNA using BioJava.
Translates the RNA, again using BioJava.
Converts the result to a string
Splits the string by stop codons (*)
Finds the coordinates of the longest coding region between stops.
Converts protein to DNA coordinates
Converts coordinates to the correct frame.

The result is the coordinates of the longest translatable region within this frame. You then repeat this for each of the frames and return the longest for the result.

The most important thing is the general approach to structuring and understanding the code. First setup everything you need, then actually call the functions to do it. I'm relatively new to functional programming but these sort of exercises in thinking through coding a different way have helped me a lot in organizing my Python code.

Ram · Answer 2 · 2011-03-09

OK, another in the "not so pretty" category, but I've spent a bit of time on it and have to show something for the effort!.

This time with python: uses a generator to 'chunk' the sequence, then sets to looking for start (if you want) or stop codons:

class ORFFinder:
  """Find the longest ORF in a given sequence

   "seq" is a string, if "start" is not provided any codon can be the start of 
   and ORF. If muliple ORFs have the longest length the first one encountered
   is printed 
   """
  def __init__(self, seq, start=[], stop=["TAG", "TAA", "TGA"]):
    self.seq = seq
    self.start = start
    self.stop = stop    
    self.result = ("+",0,0,0,0)
    self.winner = 0

  def _reverse_comp(self):
    swap = {"A":"T", "T":"A", "C":"G", "G":"C"}
    return "".join(swap[b] for b in self.seq)

  def _print_current(self):
    print "frame %s%s position %s:%s (%s nucleotides)" % self.result

  def codons(self, frame):
    """ A generator that yields DNA in one codon blocks

    "frame" counts for 0. This function yelids a tuple (triplet, index) with 
    index relative to the original DNA sequence 
    """
    start = frame
    while start + 3 <= len(self.seq):
      yield (self.seq[start:start+3], start)
      start += 3

  def run_one(self, frame_number, direction):
    """ Search in one reading frame """
    codon_gen = self.codons(frame_number)     
    while True:
      try: 
    c , index = codon_gen.next()
      except StopIteration:
    break
      # Lots of conditions here: checks if we care about looking for start 
      # codon then that codon is not a stop
      if c in self.start or not self.start and c not in self.stop:
    orf_start = index + 1 # we'll return the result as 1-indexed
    end = False
    while True:
      try: 
        c, index = codon_gen.next()
      except StopIteration:
        end = True
      if c in self.stop:
        end = True
      if end:
        orf_end = index + 3 # because index is realitve to start of codon
        L = (orf_end - orf_start)  + 1
        if L > self.winner:
          self.winner = L
          self.result = (direction, frame_number+1, orf_start, orf_end, L)
        break

  def run(self):
    direction = "+"
    for frame in range(3):
      self.run_one(frame, direction)
    direction = "-"
    self.seq = self._reverse_comp()
    for frame in range(3):
      self.run_one(frame, direction)
    self._print_current()

def little_test():
  test = ORFFinder("ACGTACGTACGTACGT").run()

def bigger_test():  
  from Bio import SeqIO
  a = SeqIO.parse("test.fasta", "fasta")
  for seq in a:
    print seq.id
    ORFFinder(seq.seq.tostring()).run()

if __name__ == "__main__":
  print "\nrunning original test sequence..."
  little_test()
  print "\nrunning Brad's test file..."
  bigger_test()

(as I said, it's not short, so I put it up as a gist too)

and when run:

$ python of.py

running original test sequence...
frame +1 position 1:15 (15 nucleotides)

running Brad's test file...
t-1
frame +1 position 1:15 (15 nucleotides)
t-2
frame +2 position 2:16 (15 nucleotides)
t-3
frame +3 position 3:17 (15 nucleotides)
t-r1
frame -1 position 1:15 (15 nucleotides)
t-r2
frame -2 position 2:19 (18 nucleotides)
t-r3
frame -3 position 3:17 (15 nucleotides)
t-1-inner
frame +1 position 4:21 (18 nucleotides)
EX720612.1
frame +1 position 1:252 (252 nucleotides)

Happy for comments, tips, insights (still very much learning as a programmer)

Ram · Answer 3 · 2011-03-08

Haskell version, using Ketil Malde's Biohaskell:

import Bio.Sequence
import Data.List (sortBy)
import System.Environment (getArgs)

-- Take a sequence and a frame, return the longest ORF in the frame (wrapper)
readFrame :: Sequence Nuc -> Offset -> (Offset, Offset, Offset)
readFrame seq frame = if frame > 0 then (frame, start+frame-1, end+frame-1)
                                   else (frame, len-end+frame+1, len-start+frame+1)
    where len = seqlength seq
          offset = if frame > 0 then frame-1 else -(frame+1)
          tseq = translate seq offset
          (start, end) = readFrameAux 0 0 0 0 tseq

-- Compute the longest ORF of a sequence (worker)
readFrameAux :: Offset -> Offset -> Offset -> Offset -> [Amino] -> (Offset, Offset)
readFrameAux pos len maxpos maxlen (x:xs) = case x of
    --Met -> readFrameAux (pos+len) 3 maxpos maxlen xs
    STP -> if (len+3) > maxlen then readFrameAux (pos+len+3) 0 pos    (len+3) xs
                               else readFrameAux (pos+len+3) 0 maxpos maxlen  xs
    _   -> readFrameAux pos (len+3) maxpos maxlen xs
readFrameAux pos len maxpos maxlen _ = if len > maxlen then (pos, pos+len)
                                                     else (maxpos, maxpos+maxlen)

-- Take a sequence, return the longest ORF along all frames
findOrf :: Sequence Nuc -> (Offset, Offset, Offset)
findOrf seq = head $ sortBy (\(f1,s1,e1) (f2,s2,e2) -> compare (e2-s2) (e1-s1)) allFrames
    where frame1 = readFrame seq 1
          frame2 = readFrame seq 2
          frame3 = readFrame seq 3
          rseq = revcompl seq
          frameMinus1 = readFrame rseq (-1)
          frameMinus2 = readFrame rseq (-2)
          frameMinus3 = readFrame rseq (-3)
          allFrames = [frame1, frame2, frame3, frameMinus1, frameMinus2, frameMinus3]

-- Read sequences from FASTA file given as command-line argument
main :: IO ()
main = do [file]  <- getArgs
          seqs  <- readFasta file
          mapM_ (\seq -> (putStrLn . toStr . seqlabel) seq >> (print . findOrf) seq) (map castToNuc seqs)

Build:

$ ghc --make FindOrf
[1 of 1] Compiling Main             ( FindOrf.hs, FindOrf.o )
Linking FindOrf ...

Run:

$ ./FindOrf test-in.fa
t-1
(1,0,15)
t-2
(2,1,16)
t-3
(3,2,17)
t-r1
(-1,0,15)
t-r2
(-2,0,18)
t-r3
(-3,0,15)
t-1-inner
(1,3,21)
EX720612.1
(1,0,252)

readFrame :: Sequence Nuc -> Offset -> (Offset, Offset, Offset) is used to compute the longest ORF in a sequence within a given frame (1, 2, 3, -1, -2, -3). It is a wrapper for the tail-recursive readFrameAux function. We are able to express the core algorithm in a succinct and clear way in readFrameAux thanks to Haskell's beautiful pattern-matching syntax.

Actually, readFrame is slightly more than an initialiser for readFrameAux, as it also performs the following operations:

Compute the right offset for the given frame
Translate the Seq to [Amino] with the computed offset
Adjust the result positions depending on the frame.

The second part of the computation, done in findOrf :: Sequence Nuc -> (Offset, Offset, Offset) is to gather the results for all frames, and return the longest one.

The main function handles IO.

Please share any ideas/comments and help improve the code !

Ram · Answer 4 · 2011-03-08

Here's my attempt! It used 1-based coordinates, does not include terminating stop codons in the ORF, but I believe it is otherwise correct. It doesn't care about Mets in the ORFs, and the sequence of ORFs that is searched is ordered to produce the expected output.

Note that the bulk of the code is the formatting and output :-)

-- find longest ORF in a Fasta file

import Bio.Sequence
import System.IO
import Data.List (maximumBy, tails)

main :: IO ()
main = hReadFasta stdin >>= mapM_ (putStrLn . doit . castToNuc)

doit s = format s . maximumBy comp_orflength . all_orfs $ s

format :: Sequence Nuc -> (Frame,Length,Offset) -> String
format s (frm, l, off)
  | frm > 0 = sq++"+"++show frm++" "++show (frm+off)++" "++show (frm+off+len-1)
  | otherwise = sq++show frm++" "++show (sl+frm-off-len+2)++" "++show (sl-off+frm+1)
    where len = fromIntegral l
          sl = fromIntegral (seqlength s)
          sq = toStr (seqheader s) ++ ":\t"

comp_orflength (_,l1,_) (_,l2,_) = compare l1 l2

So much for output formatting, here's the ORF searching and selection code:

type Frame = Offset -- -3..+3
type Length = Int

all_orfs :: Sequence Nuc -> [(Frame,Length,Offset)]
all_orfs s = map findlength $ truncateStp $ transtails $ translations s
  where findlength (f,o,as) = (f,3*length as,o)

truncateStp :: [(Frame,Offset,[Amino])] -> [(Frame,Offset,[Amino])]
truncateStp = map stopSTP
  where stopSTP (f,o,as) = (f,o,takeWhile (/= STP) as) -- dumb syntax highlighting /

transtails :: [(Frame,[Amino])] -> [(Frame,Offset,[Amino])]
transtails = concatMap atails
  where atails (f,as) = [(f,i,ts) | (i,ts) <- zip [0,3..] (tails as)]

translations :: Sequence Nuc -> [(Frame,[Amino])]
translations s = [(negate i,translate (revcompl s) (i-1)) | i <- [3,2,1]]
                 ++ [(i,translate s (i-1)) | i <- [3,2,1]]

Results:

*Main> rs <- map castToNuc `fmap` readFasta "/home/ketil/test-in.fa"
*Main> mapM_ (putStrLn . doit .castToNuc) rs
t-1:    +1 1 12
t-2:    +2 2 16
t-3:    +3 3 17
t-r1:   -1 4 15
t-r2:   -2 1 18
t-r3:   -3 1 15
t-1-inner:  +1 4 18
EX720612.1: +1 1 252

And the initial test case:

*Main> let x = Seq (fromStr "foo") (fromStr "ACGTACGTACGTACGT") Nothing
*Main> mapM_ (putStrLn . doit .castToNuc) [x]
foo:    +1 1 15