A sequence of nucleotides with multiple genes and UTRs are given (seq):
<--UTR--><----------------Gene 1-------------><---- UTR-----><----------- Gene 2 --------------------><-- UTR-->
TGGAGA_startCodon_AGGAAG_stopCodon_GAAGGTAAC_statrCodon_AGCTCTG_stopCodon_ATCAAGA
There could be multiple out-frame/overlapping start codons between each primary start and stop codon shown above, or in UTRs. The position of all instances of overlapping in-/out-frame start codons can be found as follows:
startCodons = ['ATG','GTG','CTG','TTG']
# Start positions of start codons
startCodons_pos = {}
for startCodon_seq in startCodons:
startCodons_pos[startCodon_seq] = [m.start() for m in re.finditer('(?=' + startCodon_seq + ')', seq)]
While the start codons can be in- or out-frame or overlapping, I need to find only the stop codons that are in-frame with respect to each 'primary' start codon. This can be done by using multiple loops, however, I was wondering if a smarter way of doing it in python exists.
This old (and fun!) thread might help you. It as about finding the longest ORF in all 6-frames, but you can probably hack one of the results to include all ORFs Code golf: Finding ORF and corresponding strand in a DNA sequence
I wrote something for it here: https://github.com/vsbuffalo/findorf/blob/master/findorf/orfprediction.py but this may be too project-specific. It handles some other cases though which you may find interesting.
If you're learning python or bioinformatics it's a good exercise, otherwise you can use EMBOSS sixpack.