Hi all,
It's been too long since we've had a Code golf! Following Pierre's suggestion in this question, here is a code golf about finding ORFs :)
So, here's the problem:
Write a program that finds the longuest ORF (closed or open) in a DNA sequence and returns the beginning and ending positions of the ORF, as well as the frame shift (-1, -2, -3, +1, +2, +3). Use the Standard Genetic Code. Note: the start/end positions are for the DNA sequence, not the corresponding amino acid sequence. In case there are no stop codons, return a +1 frame shift and the begining and end positions of the sequence.
As previously, you can use anything that can be run on a Linux terminal (your favorite language, emboss, awk...). The main interests of code golf question are to:
- See diversity of approaches
- Take on a small challenge
- Show off :)
- Most importantly: have fun!
Here is an example run: find_orf("ACGTACGTACGTACGT")
Should return:
frame: +1
ORF_start: 1
ORF_end: 16
Or, in another format:
"+1 1 16"
You can use this test file to test the output of your program.
EDIT: It appears that the challenge is a bit less simple than I thought it would be... I'll put a one week extension before giving the correct answer. Please don't hesitate to put anything you would use from the command line, including things like a command-line version of ORFinder or the such.
The correct answer will be given to the most popular answer on Thursday, March 3rd at 15:00 hour (US Eastern Time).
EDIT: Answer goes to Brad and the Clojure solution :) Nice to see some real life examples in Lisp! Could you plug a recursive function in there? :P It seems that this problem was not that small after all, so many thanks to all who contributed answers!
Cheers!
How many different genetic codes should this support all 16+? because then already the code tables become enourmous and you are better off using an existing tool like getOrf.
Hi Michael. I edited the question so that only the Standard Genetic Code be used.
What about selenocysteine/pyrrolysine?
As this is a code golf, and not an attempt to make a complete program that covers all the special cases, I'd recommend to stick with the Standard Genetic Code and the 3 standard stop codons. The aim is to show your approach in your favorite language and share it. Cheers
Any chance of another 1 week extension? I'd like to tackle this one, but have not had the time.
Hi Neil. The extension currently in effect would still leave you one week (until March 10th). I also have an almost finished version, but don't find much time either :) Is this extension long enough for your needs? I wouldn't mind reporting it one week farther, but I don't want it to become a running joke and do it until the end of civilization by the end of 2012 :D
You want to report the 1-base coordinates of the start of the first codon, and the start of the first "non-codon", i.e. first nucleotide after the reading frame? And for reverse? And do you want to include any terminating stop codon in the ORF? I think some of the test sequences have multiple ORFs of the same length, could you also provide the expected output (or all valid outputs) for the test sequences?
Ketil, I came up with those test sequences as just something short to play with. Due to being toy examples, they do have multiple ORFs of the same size. In the case of ties, I just reported the first frame ordered as (1, 2, 3, -1, -2, -3).
Ketil, I came up with those short test examples by hand. They do have multiple ORFs of the same size and I reported the first frame in (1, 2, 3, -1, -2, -3) order. They are more useful to double check everything works in all 6 coordinates than to simulate real data.
Almost 11 years later, maybe a simple question, how come that no one used regular expression to retrieve positions of start and stop codons? this way we don't need the frame shift..right? I tested it with R and all I had to do is to construct the ORF from the obtained matrix of inedices.
Because simply finding the potential start and stop codons is not enough. They also need to be in frame, meaning the distance between them is divisible by 3. This requirement cannot be implemented in regular expressions. To check this, however, makes the problem more computationally complex than it needs to be. Indeed, it becomes quadratically dependent on the number of start and stop codons, while it could be at most linear depending on the length of the sequence. Or in O notation, l: length of sequence: memory and run-time complexity (l) = l + l^2 = O(l^2)
The quadratic nature of this is implicit in what you wrote about the matrix:
Using a matrix indicates that the problem is quadratic now, and "all you have to do" remains little work only in the case of few codons found. Imagine a worst-case scenario where the sequence consists only of start and stop codons.