I am learning python through tutorials. I created a dictionary for multi-fasta file using the following code as explained in a lecture. I couldn't find how to use the dictionary in a python function I wrote for finding ORFs.
The code used to create the dictionary
f=open('dna.example.fa')
seqs={}
for line in f:
line=line.rstrip()
if line[0]=='>':
words=line.split()
name=words[0][1:]
seqs[name]=''
else:
seqs[name]=seqs[name]+line
Code for parsing the file,
for name,seq in seqs.items():
dna=(name,seq)
Code I wrote for finding ORFs
def orf_finder1(dna, frame=1):
start_codon=['ATG']
stop_codon=['TAA', 'TAG', 'TGA']
atg_pos=[]
stp_pos=[]
for i in range(frame, len(dna),3):
codon=dna[i:i+3]
if codon in start_codon:
atg_pos.append(i)
x=i
for k in range(i+3, len(dna),3):
codons=dna[k:k+3]
if codons in stop_codon:
stp_pos=[]
s=k+3
l=len(dna[x:s])
d=dna[x:s]
print(l)
print(d)
break
orf_finder1('AATGTTGACTAGCTAGCATGCAAGCTAGCTAA')
output:
15
ATGTTGACTAGCTAG
I haven't started learning biopython yet. Kindly someone please help me to feed the multifasta file into this function. Thank you.
Thank you. But I want to use the dictionary so as to identify the ORFs corresponding to the fasta header. i.e. to print the ORFs, present in seqs.values for each seqs.key. I tried replacing the
dna
in the code withseqs.values
but it throws error. May be I have to identify ORFs under each fasta header and loop over. But I am not getting a clue on how to do it.Maybe I've misunderstood what you want.
The loop will go through every sequence in your
dna.example.fa
- (it would be great to have this file for any testing - this way I have to mock up my own example file) - and it will apply theorf_finder1
to the sequence. You can check that it traverses wholedict
by addingprint(name)
to the for loop.If I read your code correctly, your
orf_finder1
will print only the first ORF it finds.Do you want to return all ORFs in the sequence? If so, you need to rewrite the
orf_finder1
function. This has nothing to do with your input file being multifasta or your data being stored in thedict
.If I may have an recommendation - Do try to learn BioPython, It's not panacea, but it will make your life simpler, especially for file reading writing.
Also, do check http://biopython.org/DIST/docs/tutorial/Tutorial.html especially section "20.1.13 Identifying open reading frames".