Finding longest CDS for the sequence
1
0
Entering edit mode
2.5 years ago
Sasha ▴ 10

Can someone please check/explain if my way of finding the longest CDS(potential coding sequence) is correct

 import re
regex = r"(?:ATG|atg)\w+(?:TAG|TGA|TAA|tag|tga|taa)"

test_str = "ttgcgttaaggtagatgatttttgtattttatattttggggagcaagaggacatggtatacagactcgtcatgttgctaatttcgccatcggcacacgggatgcagtcgaagcagcagaccggctctccctgtcggatccctttcctggtgccaggcttgcaattctcactgcacactgaccgaggtggctgaaaagcacagatgcagttgcatgacatttcctggcctaaaaattgaaaagaagtaaattatgaacagtgtgcaactgacctctaatgcttcattgttccagacaatggaatcattcttaatagtcatcctctcctctggggccaaggaagcattgtattgccccacggtcacatataatattcccccgttactgtcgagtccccaattgatgatgtcgtaatatccctccacatctctgccgtcaaagtagacctcctcatctgtgtgtggcactttgaagcgcacattcttcaagtagtacataagctataaacaaatgtcagatttcaatcaaacacgcacaagtggccacctctctcaaacgcaaacatggcaaagcgaaaaaatggattggagagaaagaacaataccagttatctgtcaagtttggcccgtataagggattgattggacaacacagagaaagtcacgtacctgccacggctcaaacttggtgatatcagcacagccgatcccaatgaatggcccgtgtcccggctcgcagtgctccagattgtgcagagcatgggcaacggcatacaccgctttgtatacgctgtggcaagacattgcagcctaccatgacaactccataataataataataggacgacaatggacaaaacctacatttgataggggactggataatgactcaagaagctagagtgtgaagtcacctgtaggtgattcgaagctgagaaatgtcagagtaggtgttgttgagctgagctaatgactcgctgccagta"

matches = re.search(regex, test_str)

if matches:
    print ("{match}".format(match = matches.group()))

I am checking some of my results using https://web.expasy.org/translate/ and some of them are matching but some are not.

Python Biopython • 1.5k views
ADD COMMENT
1
Entering edit mode

did you take forward and reverse strand into account ?

and that is should be modulo 3 ? (not all combo's of ATG-stop are thus valid CDS)

ADD REPLY
0
Entering edit mode

No, I did not. Am i right that there are 3 different forward and reverse strands?

ADD REPLY
0
Entering edit mode

not really , there is only one forward and one reverse strand , however on top of that each has 3 frames: 3 on the forward and 3 on the reverse

though the way you look for ORFs makes that this is not an issue for your (or the updated/corrected one from Michael Dondrup ). You look for all of them at once, you do have to take into account to also look on the reverse.

ADD REPLY
2
Entering edit mode
2.5 years ago
Michael 55k

No, it is incorrect in several ways. You are looking for open reading frames (ORF) on the forward strand with the (alternative) definition that an ORF is any sequence of codons between a start and a stop codon in the same frame, thus length must be divisible by 3. Also, the shortest ORF is then a start codon, followed directly by a stop codon.

So, your regex could look something like this:

(?:ATG)([ATGC]{3,3})*(?:TAG|TGA|TAA) or (?:ATG)(\w\w\w)*(?:TAG|TGA|TAA)

Alternatively, you could check if length %% 3 == 0.

If you expect mixed capital and non-capital letters you should turn on case-independent matching, e.g. what's with TaG, taG, etc? If you want to get reverse strand ORF's also, you have to do the same matching on the reverse complement as well.

To identify ORFs correctly, you should not rely on self-written code but use getOrf (EMBOSS) or similar.

ADD COMMENT
1
Entering edit mode

One minor comment here - if this is for an educational exercise or the like, i think that a (huge) part of bioinformatics is knowing what packages are out there, which are the most reliable, what makes them reliable, etc.

Nevertheless, i think it is defensible for educators to ask you to write your own solution to a bioinformatics problem like this; even if the solution one comes up with is wrong theres still substantial learning opportunity ...

So, while I agree that re-writing this bit of code is probably not the right way to go, do consider if it is worthwhile to look at the code corresponding to this task from the solution you end up choosing. How did they do it, and why?

Good luck

ADD REPLY
0
Entering edit mode

Thank you for your reply! My degree is marine biology and never had a chance to take a bioinformatics course so just trying to learn myself some bits so I can improve my final year thesis cause it still has quite a lot of potential. Thesis was on "Olfactory receptor genes in fish", so I had to use blast and python to get the gene repertoire of the fish, but while doing it I haven't been able to extract the gene sequence itself, so just trying to figure how to do it step by step.

ADD REPLY
0
Entering edit mode

I see. I was a bit confused about what I did and cannot find any guide how to find the longest CDS and even more confused that there are 1-3 frames for the forward strand and -1,-2,-3 frames for the reverse strand. Thank you for clearing what my code does tho!

ADD REPLY

Login before adding your answer.

Traffic: 1587 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6