Hello, I hope this isnt the wrong website to post this, I'm new to this and I got pretty lost
I have to write a function where there is a protein sequence that needs all possible combinations of tRNAs that it can be converted to, as long as they dont exceed 5000 combinations (e.g. for the sequence “CH” the script’s output would be “UGUCAU”, “UGUCAC”, “UGCCAU”, and “UGCCAC”, a.k.a 4 combinations). And as long as the sequence has an identifier.
I generated this code, but It doesn't seem to be working, I'm very new to python, so it probably has a lot of mistakes :
from Bio import SeqIO
from Bio import SeqUtils
import itertools
def all_combinations (file_name, sequence_ID):
seq_records = SeqIO.parse(file_name, 'fasta')
sequence_ID = record.id
for record in seq_records:
record = [record.reverse_complement(id="rc_"+record.id, description = "reverse complement")\
for record in seq_records if len(record)<5000]
for L in range(0, len(record)+1):
subset = itertools.combinations(record, L)
if record.id in seq_record:
print (subset)
else:
print ('error: sequence doesnt have an indentifier')
Im sorry if this is messy or wrong, this is my first time posting anywhere.
Thank you very much for anyone who can help.
The function is not indented, is it like this in the original script?
There is no original script, i started it on my own based on whats required, what do you mean by the function is not indented
I mean that in python you should indent each clause, e.g.:
what about now ? I tried to edit it a little.
The bottom six lines don't look correctly indented.
What Asaf was trying to tell you: in Python the indentation of code blocks is an essential part of the control flow. If you don't indent correctly, the code might work but would potentially do wrong things. In other programming languages this is often dealt with by delimiting code blocks with
{
and}
.I assume you mean codons (or anticodons) instead of tRNAs?
Concerning your code: I'm sorry, but it doesn't seem to make sense. Can you explain what your approach to the problem is and what you think your code is doing?
Oh I'm really sorry, ok... we need to do a reverse function, that goes from the protein amino acids to the tRNA that composed them, therefore we are supposed to have different combinations, like in the example i wrote, so the code takes a fasta file that contains a protein sequence, and the sequence identifier, and its supposed to write all the possible RNA combinations for the sequence in the fasta file, with a condition that the total of combinations should be less than 5000, then it should give the number of the combinations without actually printing them, like for the example it should only write 4.
I'm really thankful, and sorry to pull this, I know It's probably lame, I just thought i could find some help if i can
EDIT: I read your question more thoroughly, and it looks like you're indeed looking for anti-codons (unless your use of reverse complement is by mistake) and you have a cap on the combinations, which is good. Just a heads up: You'll reach 5000 sequences in just about 8 AAs, assuming each AA can be coded by 3 codons.
You're looking for the codons (mRNA triplet sequences), not the "tRNA sequences". The transfer RNA has the anticodon sequence that binds to the mRNA codon sequence - each tRNA brings a unique amino acid that becomes part of the polypeptide chain being assembled.You'd benefit from having a dictionary where for each amino acid key, you have an array of possible codons. This would mean that there would be a geometric explosion of DNA sequences for any polypeptide sequence as the number of amino acids increases. Say, each AA has 3 codons on average, a 2-aa input will give you 9 different sequences. You will end up with 3n output sequences wheren
is the length of the input sequence. For a 20 AA peptide, you'd have 3 billion output sequences. Exercise caution - I don't see how you can store these rapidly mutating strings in memory without abusing RAM.No worries, I just asked for the purpose of understanding how to better help you with this. Have a look at _r_am's answer below (even if it is struck through).
Build yourself said dictionary, e.g.
and then use this information to calculate the number of possible combinations for each protein sequence. For each protein with < 5000 combinations use the very same dictionary to reconstruct and print the respective sequences. There is no need to use
itertools.combinations
or similar.Oh ok, I tried that, did the dictionary, then went to write a function that turns elements of the fasta file into a list, then tried to get combinations out of it based on the dictionary, but I got an error...Im' really sorry to bother you, but can you just check the code I wrote ?
What's the error you're getting?
I'd do something like this:
Think about how you could do the
generate_all_combinations
step. I don't think you need to use anyitertools
function. Write it down in pseudocode here and then we can check how that works in Python.Edit: replaced addition with multiplication (thanks _r_am!)
I think combinations would multiply, not add.
n_combinations *= codon_table.get(aa, 0)
(if I'm understanding theget
function correctly - I don't know python).One could even map an array of the first 8 AAs to their number of codons and reduce this array with
*
. If that operation exceeds 5000, skip the sequence. If not, start at length=8 and go one by one. There should also be a cap - say 15 AAs - beyond which you can safely eliminate the sequence without even needing to calculate the number of combinations.Ouch, of course, it should be
*
, good spot, thanks!Just, fyi
codon_table.get(aa, 0)
returns0
ifaa
is not in the table instead of erroring out.Re your other suggestions -- yea, second those! Except if we have an
M
-polymer orW
-polymer with length < 5001 :)Ah, I see. Thanks for the explanation. I wonder why OP is looking for the anti-codon sequence and not the codon sequence though.
BTW I googled a bit and there exists JS code (from the Sequence Manipulation Suite) that does what OP wants and adds in a probability parameter based on codon usage statistics. Maybe OP would want to implement the logic in python: https://github.com/paulstothard/sequence_manipulation_suite/blob/master/docs/scripts/multi_rev_trans.js
Maybe it's an extra level of complexity in the task (or a misunderstanding).