I have sequences (sanger) that I trimmed using blast against reference sequences use the query start/end positions. Then using translate from Biopython, I translate each sequence in the 3 5'->3' frames (pre-trimming the sequences are rev-complemented), using a snippet similar to this one:
> print("In frame") print(record.seq.translate()
> print("Offset by one") print(record.seq[1:].translate())
> print("Offset by two") print(record.seq[2:].translate())
Then to get the correct translation (due to sequencing errors it's not always clear), I take the sequence that has the least number of "X" and "" within each sequence. However, sometimes that doesn't always work. Sometimes the sequence I need has maybe one more X in it's translation. In the example below, I would need the inframe translation, but offset two is chosen because the sum of "X" and "" is 28 vs 29 in the inframe translation.
CAGGNGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGAGCCTGTCCCTCACCTGCNNNGTCTNTGGTGGGTCNNTCAGNGGGTANTACNGGAGCNGGATCCNCCNCNCCCCNCNNAANAGGGGGGGAGTGGNNNGGGGGAATTCNNNNNNNTGGGNAGGNNCCACTACAACCCNNNCCCCCTAGAAGACNAGANNGGNNNANGTNNNTGAAACCNNNNNTNNNTTCTTCCTCCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGACCGGAGCAGCAGCTGCCTACCGCCTTTGGCTACTGGGGCCAGGGAACCCTAGTCACCGTCTCCTCA
in frame: QXQLQQWGAGLLKPSESLSLTCXVXGGSXXGXYXSXIXXXPXXRGGVXXGNSXXWXGXTTTXXP*KTRXXXVXETXXXFFLLLVAAPRWVLSQVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARPEQQLPTAFGYWGQGTLVTVSS
offset one: RXSYSSGAQDC*SLRRACPSPAXSXVGXSXGXTGAGSXXPXXXGGEWXGGIXXXGXXPLQPXPPRRXXXXXXXKPXXXSSSSWWQLPDGSCPRCSYSSGAQDC*SLRRPCPSPALSMVGPSVVTTGAGSASPQGRGWSGLGKSIIVEAPTTTRPSRVESPYQ*TRPRTSSP*S*AL*PPRTRLCITVRDRSSSCLPPLATGAREP*SPSP
offset two: GAATAVGRRTVEAFGEPVPHLXXLWWVXQXVXXEXDPPXPXXXGGSGXGEFXXXGRXHYNPXPLEDXXGXXX*NXXXXLPPPGGSSQMGPVPGAATAVGRRTVEAFGDPVPHLRCLWWVLQWLLLELDPPAPREGAGVDWGNQS*WKHQLQPVPQESSHHISRHVQEPVLPEAELCDRRGHGCVLLCETGAAAAYRLWLLGPGNPSHRLL
This is all part of a much larger python script, so I'm just wondering if anyone has any suggestions on choosing the correct frame. I am not looking for ORF here, there will be no start/stop codon.
I know one thing I can do is do an additional blast with each frame, and take the one with the highest % id/e-value. Although, I wanted to avoid another blast if I could help it.
I can't help but wonder what you are trying to achieve, which simultaneously makes it rather difficult to take a stab at providing an answer.
edited to show example.
So in your example, perhaps I misunderstood, how do you know you need the in frame translation while it has more "X" & "*"?
The correct sequence is an antibody heavy chain v-region, which is the in frame translation. More often than not the sequencing is good. I use Biopython to extract the fasta from the sanger file (ab1), so I can't read the chromatogram to correct any errors prior to translation if it does have a lot of errors like this example. Downstream, errors can be corrected with imgt germ-line sequence, but not if the correct translation isn't chosen.