Can someone please check/explain if my way of finding the longest CDS(potential coding sequence) is correct
import re
regex = r"(?:ATG|atg)\w+(?:TAG|TGA|TAA|tag|tga|taa)"
test_str = "ttgcgttaaggtagatgatttttgtattttatattttggggagcaagaggacatggtatacagactcgtcatgttgctaatttcgccatcggcacacgggatgcagtcgaagcagcagaccggctctccctgtcggatccctttcctggtgccaggcttgcaattctcactgcacactgaccgaggtggctgaaaagcacagatgcagttgcatgacatttcctggcctaaaaattgaaaagaagtaaattatgaacagtgtgcaactgacctctaatgcttcattgttccagacaatggaatcattcttaatagtcatcctctcctctggggccaaggaagcattgtattgccccacggtcacatataatattcccccgttactgtcgagtccccaattgatgatgtcgtaatatccctccacatctctgccgtcaaagtagacctcctcatctgtgtgtggcactttgaagcgcacattcttcaagtagtacataagctataaacaaatgtcagatttcaatcaaacacgcacaagtggccacctctctcaaacgcaaacatggcaaagcgaaaaaatggattggagagaaagaacaataccagttatctgtcaagtttggcccgtataagggattgattggacaacacagagaaagtcacgtacctgccacggctcaaacttggtgatatcagcacagccgatcccaatgaatggcccgtgtcccggctcgcagtgctccagattgtgcagagcatgggcaacggcatacaccgctttgtatacgctgtggcaagacattgcagcctaccatgacaactccataataataataataggacgacaatggacaaaacctacatttgataggggactggataatgactcaagaagctagagtgtgaagtcacctgtaggtgattcgaagctgagaaatgtcagagtaggtgttgttgagctgagctaatgactcgctgccagta"
matches = re.search(regex, test_str)
if matches:
print ("{match}".format(match = matches.group()))
I am checking some of my results using https://web.expasy.org/translate/ and some of them are matching but some are not.
did you take forward and reverse strand into account ?
and that is should be modulo 3 ? (not all combo's of ATG-stop are thus valid CDS)
No, I did not. Am i right that there are 3 different forward and reverse strands?
not really , there is only one forward and one reverse strand , however on top of that each has 3 frames: 3 on the forward and 3 on the reverse
though the way you look for ORFs makes that this is not an issue for your (or the updated/corrected one from Michael Dondrup ). You look for all of them at once, you do have to take into account to also look on the reverse.