Finding Orf And Corresponding Strand In A Dna Sequence
4
11
Entering edit mode
13.8 years ago

Hi all,

It's been too long since we've had a Code golf! Following Pierre's suggestion in this question, here is a code golf about finding ORFs :)

So, here's the problem:

Write a program that finds the longuest ORF (closed or open) in a DNA sequence and returns the beginning and ending positions of the ORF, as well as the frame shift (-1, -2, -3, +1, +2, +3). Use the Standard Genetic Code. Note: the start/end positions are for the DNA sequence, not the corresponding amino acid sequence. In case there are no stop codons, return a +1 frame shift and the begining and end positions of the sequence.

As previously, you can use anything that can be run on a Linux terminal (your favorite language, emboss, awk...). The main interests of code golf question are to:

  • See diversity of approaches
  • Take on a small challenge
  • Show off :)
  • Most importantly: have fun!

Here is an example run: find_orf("ACGTACGTACGTACGT") Should return:

frame: +1
ORF_start: 1
ORF_end: 16

Or, in another format:

"+1 1 16"

You can use this test file to test the output of your program.

EDIT: It appears that the challenge is a bit less simple than I thought it would be... I'll put a one week extension before giving the correct answer. Please don't hesitate to put anything you would use from the command line, including things like a command-line version of ORFinder or the such.

The correct answer will be given to the most popular answer on Thursday, March 3rd at 15:00 hour (US Eastern Time).

EDIT: Answer goes to Brad and the Clojure solution :) Nice to see some real life examples in Lisp! Could you plug a recursive function in there? :P It seems that this problem was not that small after all, so many thanks to all who contributed answers!

Cheers!

code orf dna codegolf • 11k views
ADD COMMENT
0
Entering edit mode

How many different genetic codes should this support all 16+? because then already the code tables become enourmous and you are better off using an existing tool like getOrf.

ADD REPLY
0
Entering edit mode

Hi Michael. I edited the question so that only the Standard Genetic Code be used.

ADD REPLY
0
Entering edit mode

What about selenocysteine/pyrrolysine?

ADD REPLY
0
Entering edit mode

As this is a code golf, and not an attempt to make a complete program that covers all the special cases, I'd recommend to stick with the Standard Genetic Code and the 3 standard stop codons. The aim is to show your approach in your favorite language and share it. Cheers

ADD REPLY
0
Entering edit mode

Any chance of another 1 week extension? I'd like to tackle this one, but have not had the time.

ADD REPLY
0
Entering edit mode

Hi Neil. The extension currently in effect would still leave you one week (until March 10th). I also have an almost finished version, but don't find much time either :) Is this extension long enough for your needs? I wouldn't mind reporting it one week farther, but I don't want it to become a running joke and do it until the end of civilization by the end of 2012 :D

ADD REPLY
0
Entering edit mode

You want to report the 1-base coordinates of the start of the first codon, and the start of the first "non-codon", i.e. first nucleotide after the reading frame? And for reverse? And do you want to include any terminating stop codon in the ORF? I think some of the test sequences have multiple ORFs of the same length, could you also provide the expected output (or all valid outputs) for the test sequences?

ADD REPLY
0
Entering edit mode

Ketil, I came up with those test sequences as just something short to play with. Due to being toy examples, they do have multiple ORFs of the same size. In the case of ties, I just reported the first frame ordered as (1, 2, 3, -1, -2, -3).

ADD REPLY
0
Entering edit mode

Ketil, I came up with those short test examples by hand. They do have multiple ORFs of the same size and I reported the first frame in (1, 2, 3, -1, -2, -3) order. They are more useful to double check everything works in all 6 coordinates than to simulate real data.

ADD REPLY
0
Entering edit mode

Almost 11 years later, maybe a simple question, how come that no one used regular expression to retrieve positions of start and stop codons? this way we don't need the frame shift..right? I tested it with R and all I had to do is to construct the ORF from the obtained matrix of inedices.

ADD REPLY
1
Entering edit mode

Because simply finding the potential start and stop codons is not enough. They also need to be in frame, meaning the distance between them is divisible by 3. This requirement cannot be implemented in regular expressions. To check this, however, makes the problem more computationally complex than it needs to be. Indeed, it becomes quadratically dependent on the number of start and stop codons, while it could be at most linear depending on the length of the sequence. Or in O notation, l: length of sequence: memory and run-time complexity (l) = l + l^2 = O(l^2)

The quadratic nature of this is implicit in what you wrote about the matrix:

all I had to do is to construct the ORF from the obtained matrix of indices

Using a matrix indicates that the problem is quadratic now, and "all you have to do" remains little work only in the case of few codons found. Imagine a worst-case scenario where the sequence consists only of start and stop codons.

ADD REPLY
7
Entering edit mode
13.8 years ago

Here's a Clojure solution using BioJava 3 for FASTA reading and translation. The code is in GitHub along with a small test file that has ORFs in all 6 frames, an internal ORF, and a longer EST sequence without a stop codon.

% lein run :findorf test.fa
t-1 [0 15 ONE]
t-2 [1 16 TWO]
t-3 [2 17 THREE]
t-r1 [0 15 REVERSED_ONE]
t-r2 [0 18 REVERSED_TWO]
t-r3 [0 15 REVERSED_THREE]
t-1-inner [3 21 ONE]
EX720612.1 [0 252 ONE]

For each of the 6 frames:

  • Translate the DNA to protein
  • Split protein translation at stops
  • Find the longest protein region
  • Remap protein coordinates back to DNA bases
  • Remap DNA coordinates to the correct frame

Finally, choose a longest frame out of the 6 and print the details.

Edit: For those who are interested in digging more into the code but find the functional style difficult, the tx-sqn-frame function is a good place to start. It begins by defining a function that take a DNA sequence and the frame to translate it in:

(defn tx-sqn-frame [dna frame]

The next step is to setup various functions and objects you need to the work. I think of this as the preamble to the actual code. This is done with the 'let' function which assigns objects or functions to names. The first part demonstrates Clojure's Java interoperability. Here we create a BioJava TranscriptionEngine Builder object, tell it to trim stop codons, then build a TranscriptionEngine:

  (let [tx-engine (-> (TranscriptionEngine$Builder.)
                      (.trimStop false)
                      (.build))

The next let statement creates a local function called dna-coords that converts a set of protein coordinates to DNA; just multiplying by 3. This is just a function and could be declared anywhere in the code; in this case I declare it locally since it's small and not used elsewhere:

        dna-coords (fn [coords]
                     (let [dna-start (* 3 (first coords))]
                       [dna-start (+ dna-start (* 3 (second coords)))]))]

Once the builder and local functions are setup, you get to the interesting part of the code that does the actual logic. The wonderful thing about Clojure here is that you almost completely remove any boilerplate associated with calling functions. The code reads almost like pseudocode. The main thing to know here in terms of syntax is that the -> macro passes the results of one function call to the next:

    (-> (.getRNASequence dna tx-engine frame)
        (.getProteinSequence tx-engine)
        .toString
        (str2/partition #"\*")
        longest-region
        dna-coords
        (frame-coords (.getLength dna) (str frame)))))

So this:

  • Transcribes the DNA using BioJava.
  • Translates the RNA, again using BioJava.
  • Converts the result to a string
  • Splits the string by stop codons (*)
  • Finds the coordinates of the longest coding region between stops.
  • Converts protein to DNA coordinates
  • Converts coordinates to the correct frame.

The result is the coordinates of the longest translatable region within this frame. You then repeat this for each of the frames and return the longest for the result.

The most important thing is the general approach to structuring and understanding the code. First setup everything you need, then actually call the functions to do it. I'm relatively new to functional programming but these sort of exercises in thinking through coding a different way have helped me a lot in organizing my Python code.

ADD COMMENT
2
Entering edit mode

Istvan -- it must have been so impressive and impenetrable that everyone else was afraid to answer. I blame myself for killing the golf. On a serious note, I'm happy to try and help with understanding this. I don't know if I'm the best person to teach Clojure or functional programming since I'm new to it, but I broke apart one of the functions to provide more insight. Hope this helps.

ADD REPLY
0
Entering edit mode

Impressive solution though for those (like me) more used to imperative style programming its almost impenetrable - I am disappointed that I don't really understand how this works

ADD REPLY
5
Entering edit mode
13.8 years ago
David W 4.9k

OK, another in the "not so pretty" category, but I've spent a bit of time on it and have to show something for the effort!.

This time with python: uses a generator to 'chunk' the sequence, then sets to looking for start (if you want) or stop codons:

class ORFFinder:
  """Find the longest ORF in a given sequence

   "seq" is a string, if "start" is not provided any codon can be the start of 
   and ORF. If muliple ORFs have the longest length the first one encountered
   is printed 
   """
  def __init__(self, seq, start=[], stop=["TAG", "TAA", "TGA"]):
    self.seq = seq
    self.start = start
    self.stop = stop    
    self.result = ("+",0,0,0,0)
    self.winner = 0

  def _reverse_comp(self):
    swap = {"A":"T", "T":"A", "C":"G", "G":"C"}
    return "".join(swap[b] for b in self.seq)

  def _print_current(self):
    print "frame %s%s position %s:%s (%s nucleotides)" % self.result

  def codons(self, frame):
    """ A generator that yields DNA in one codon blocks

    "frame" counts for 0. This function yelids a tuple (triplet, index) with 
    index relative to the original DNA sequence 
    """
    start = frame
    while start + 3 <= len(self.seq):
      yield (self.seq[start:start+3], start)
      start += 3

  def run_one(self, frame_number, direction):
    """ Search in one reading frame """
    codon_gen = self.codons(frame_number)     
    while True:
      try: 
    c , index = codon_gen.next()
      except StopIteration:
    break
      # Lots of conditions here: checks if we care about looking for start 
      # codon then that codon is not a stop
      if c in self.start or not self.start and c not in self.stop:
    orf_start = index + 1 # we'll return the result as 1-indexed
    end = False
    while True:
      try: 
        c, index = codon_gen.next()
      except StopIteration:
        end = True
      if c in self.stop:
        end = True
      if end:
        orf_end = index + 3 # because index is realitve to start of codon
        L = (orf_end - orf_start)  + 1
        if L > self.winner:
          self.winner = L
          self.result = (direction, frame_number+1, orf_start, orf_end, L)
        break

  def run(self):
    direction = "+"
    for frame in range(3):
      self.run_one(frame, direction)
    direction = "-"
    self.seq = self._reverse_comp()
    for frame in range(3):
      self.run_one(frame, direction)
    self._print_current()

def little_test():
  test = ORFFinder("ACGTACGTACGTACGT").run()

def bigger_test():  
  from Bio import SeqIO
  a = SeqIO.parse("test.fasta", "fasta")
  for seq in a:
    print seq.id
    ORFFinder(seq.seq.tostring()).run()

if __name__ == "__main__":
  print "\nrunning original test sequence..."
  little_test()
  print "\nrunning Brad's test file..."
  bigger_test()

(as I said, it's not short, so I put it up as a gist too)

and when run:

$ python of.py

running original test sequence...
frame +1 position 1:15 (15 nucleotides)

running Brad's test file...
t-1
frame +1 position 1:15 (15 nucleotides)
t-2
frame +2 position 2:16 (15 nucleotides)
t-3
frame +3 position 3:17 (15 nucleotides)
t-r1
frame -1 position 1:15 (15 nucleotides)
t-r2
frame -2 position 2:19 (18 nucleotides)
t-r3
frame -3 position 3:17 (15 nucleotides)
t-1-inner
frame +1 position 4:21 (18 nucleotides)
EX720612.1
frame +1 position 1:252 (252 nucleotides)

Happy for comments, tips, insights (still very much learning as a programmer)

ADD COMMENT
1
Entering edit mode

David, very nice. Regarding the logic, my main suggestion would be to work on simplifying the logic in run_one, which gets deeply nested. Here is an implementation that follows your general approach but uses a single loop: https://gist.github.com/862240. In terms of organization, you might want to think about hiding the implementation details from the caller of your function. I would define run as a standalone function that takes a sequence. Then internally it would build an ORFFinder and print the resulting output so the caller never has to know about it.

ADD REPLY
0
Entering edit mode

Thanks Brad, treat these very much as a learning experience so very glad to get some feedback!

ADD REPLY
4
Entering edit mode
13.8 years ago
Pc ▴ 40

Haskell version, using Ketil Malde's Biohaskell:

import Bio.Sequence
import Data.List (sortBy)
import System.Environment (getArgs)

-- Take a sequence and a frame, return the longest ORF in the frame (wrapper)
readFrame :: Sequence Nuc -> Offset -> (Offset, Offset, Offset)
readFrame seq frame = if frame > 0 then (frame, start+frame-1, end+frame-1)
                                   else (frame, len-end+frame+1, len-start+frame+1)
    where len = seqlength seq
          offset = if frame > 0 then frame-1 else -(frame+1)
          tseq = translate seq offset
          (start, end) = readFrameAux 0 0 0 0 tseq

-- Compute the longest ORF of a sequence (worker)
readFrameAux :: Offset -> Offset -> Offset -> Offset -> [Amino] -> (Offset, Offset)
readFrameAux pos len maxpos maxlen (x:xs) = case x of
    --Met -> readFrameAux (pos+len) 3 maxpos maxlen xs
    STP -> if (len+3) > maxlen then readFrameAux (pos+len+3) 0 pos    (len+3) xs
                               else readFrameAux (pos+len+3) 0 maxpos maxlen  xs
    _   -> readFrameAux pos (len+3) maxpos maxlen xs
readFrameAux pos len maxpos maxlen _ = if len > maxlen then (pos, pos+len)
                                                     else (maxpos, maxpos+maxlen)

-- Take a sequence, return the longest ORF along all frames
findOrf :: Sequence Nuc -> (Offset, Offset, Offset)
findOrf seq = head $ sortBy (\(f1,s1,e1) (f2,s2,e2) -> compare (e2-s2) (e1-s1)) allFrames
    where frame1 = readFrame seq 1
          frame2 = readFrame seq 2
          frame3 = readFrame seq 3
          rseq = revcompl seq
          frameMinus1 = readFrame rseq (-1)
          frameMinus2 = readFrame rseq (-2)
          frameMinus3 = readFrame rseq (-3)
          allFrames = [frame1, frame2, frame3, frameMinus1, frameMinus2, frameMinus3]

-- Read sequences from FASTA file given as command-line argument
main :: IO ()
main = do [file]  <- getArgs
          seqs  <- readFasta file
          mapM_ (\seq -> (putStrLn . toStr . seqlabel) seq >> (print . findOrf) seq) (map castToNuc seqs)

Build:

$ ghc --make FindOrf
[1 of 1] Compiling Main             ( FindOrf.hs, FindOrf.o )
Linking FindOrf ...

Run:

$ ./FindOrf test-in.fa
t-1
(1,0,15)
t-2
(2,1,16)
t-3
(3,2,17)
t-r1
(-1,0,15)
t-r2
(-2,0,18)
t-r3
(-3,0,15)
t-1-inner
(1,3,21)
EX720612.1
(1,0,252)

readFrame :: Sequence Nuc -> Offset -> (Offset, Offset, Offset) is used to compute the longest ORF in a sequence within a given frame (1, 2, 3, -1, -2, -3). It is a wrapper for the tail-recursive readFrameAux function. We are able to express the core algorithm in a succinct and clear way in readFrameAux thanks to Haskell's beautiful pattern-matching syntax.

Actually, readFrame is slightly more than an initialiser for readFrameAux, as it also performs the following operations:

  • Compute the right offset for the given frame
  • Translate the Seq to [Amino] with the computed offset
  • Adjust the result positions depending on the frame.

The second part of the computation, done in findOrf :: Sequence Nuc -> (Offset, Offset, Offset) is to gather the results for all frames, and return the longest one.

The main function handles IO.

Please share any ideas/comments and help improve the code !

ADD COMMENT
0
Entering edit mode

Very nice solution. For those who asked about functional techniques, one trick is to turn everything into a sequence and then process on items sequentially; readFrame' does an especially nice job of this. Pierre, one potential issue I noticed is the Met rule in readFrame' that restarts the count. This would incorrectly reset on any internal Met residues within coding sequences; none of the test examples had these. It looks like you should be able to safely remove this case and rely on the STP and default. I tried to test but mapcastToNuc is undefined; maybe I am missing a library?

ADD REPLY
0
Entering edit mode

Nice! There is also a 'translate' function you could use (in Bio.Sequence), but it counts frames as ±0..2.

ADD REPLY
0
Entering edit mode

Appraently, the syntax highlighting code breaks on primes ('), and on /=, causing everything after to be highlighted in red. In my submission, I added a comment to stop the highlighting. On the plus side, the highlighting appears to identify the code as Haskell.

ADD REPLY
0
Entering edit mode

Brad: You're right, I corrected the code but just commented the case so other readers can see where the problem was. Thanks ! mapcastToNuc is a typo, sorry for that. Ketil: I did use it in readFrame, that's why readFrameAux takes [Amino] as argument. Actually, readFrame is slightly more than a wrapper/initialiser for readFrameAux, as it also takes care of : -Computing the right offset for the frame we want -Translating the sequence -Adjusting the result positions You're right about syntax highlighting, so I changed the identifiers that were causing trouble. Thanks for the feedback !

ADD REPLY
0
Entering edit mode

Brad: You're right, I corrected the code but just commented the case so other readers can see where the problem was. Thanks ! mapcastToNuc was a typo, sorry for that.

Ketil: I did use translate in readFrame, and edited the answer to give a better description of what it does.

You're right about syntax highlighting, I changed the identifiers that were causing trouble.

Thanks for the feedback !

ADD REPLY
4
Entering edit mode
13.8 years ago
Ketil 4.1k

Here's my attempt! It used 1-based coordinates, does not include terminating stop codons in the ORF, but I believe it is otherwise correct. It doesn't care about Mets in the ORFs, and the sequence of ORFs that is searched is ordered to produce the expected output.

Note that the bulk of the code is the formatting and output :-)

-- find longest ORF in a Fasta file

import Bio.Sequence
import System.IO
import Data.List (maximumBy, tails)

main :: IO ()
main = hReadFasta stdin >>= mapM_ (putStrLn . doit . castToNuc)

doit s = format s . maximumBy comp_orflength . all_orfs $ s

format :: Sequence Nuc -> (Frame,Length,Offset) -> String
format s (frm, l, off)
  | frm > 0 = sq++"+"++show frm++" "++show (frm+off)++" "++show (frm+off+len-1)
  | otherwise = sq++show frm++" "++show (sl+frm-off-len+2)++" "++show (sl-off+frm+1)
    where len = fromIntegral l
          sl = fromIntegral (seqlength s)
          sq = toStr (seqheader s) ++ ":\t"

comp_orflength (_,l1,_) (_,l2,_) = compare l1 l2

So much for output formatting, here's the ORF searching and selection code:

type Frame = Offset -- -3..+3
type Length = Int

all_orfs :: Sequence Nuc -> [(Frame,Length,Offset)]
all_orfs s = map findlength $ truncateStp $ transtails $ translations s
  where findlength (f,o,as) = (f,3*length as,o)

truncateStp :: [(Frame,Offset,[Amino])] -> [(Frame,Offset,[Amino])]
truncateStp = map stopSTP
  where stopSTP (f,o,as) = (f,o,takeWhile (/= STP) as) -- dumb syntax highlighting /

transtails :: [(Frame,[Amino])] -> [(Frame,Offset,[Amino])]
transtails = concatMap atails
  where atails (f,as) = [(f,i,ts) | (i,ts) <- zip [0,3..] (tails as)]

translations :: Sequence Nuc -> [(Frame,[Amino])]
translations s = [(negate i,translate (revcompl s) (i-1)) | i <- [3,2,1]]
                 ++ [(i,translate s (i-1)) | i <- [3,2,1]]

Results:

*Main> rs <- map castToNuc `fmap` readFasta "/home/ketil/test-in.fa"
*Main> mapM_ (putStrLn . doit .castToNuc) rs
t-1:    +1 1 12
t-2:    +2 2 16
t-3:    +3 3 17
t-r1:   -1 4 15
t-r2:   -2 1 18
t-r3:   -3 1 15
t-1-inner:  +1 4 18
EX720612.1: +1 1 252

And the initial test case:

*Main> let x = Seq (fromStr "foo") (fromStr "ACGTACGTACGTACGT") Nothing
*Main> mapM_ (putStrLn . doit .castToNuc) [x]
foo:    +1 1 15
ADD COMMENT
0
Entering edit mode

Clean, concise, idiomatic Haskell. I still have much to learn :) The list comprehension in transtails is pretty clever.

ADD REPLY
0
Entering edit mode

I enjoyed and learned from this one as well. The tails approach is really nice, and lets you deal with the whole thing as list processing and avoid any recursion. Including actually finding the maximum region and converting coordinates as part of "formatting and output" gave me a nice chuckle. Thanks, Ketil.

ADD REPLY
0
Entering edit mode

I wanted to make use of higher order functions (e.g. map and filter) as much as possible, and avoid explicit recursion. The requirement to carry around information like coordinates and sequence name and length complicates things a bit, so the elegance gets diluted.

ADD REPLY
0
Entering edit mode

Okay, I've included this into the biolib examples. I've commented everything a bit, and fixed the STP issue. Check it out (i.e. take a look, or use 'darcs get') at http://malde.org/~ketil/biohaskell/biolib/examples/Orf.hs.

ADD REPLY

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6