How to feed the sequence dictionary created with multifasta file into python function
1
0
Entering edit mode
4.6 years ago
kousi31 ▴ 100

I am learning python through tutorials. I created a dictionary for multi-fasta file using the following code as explained in a lecture. I couldn't find how to use the dictionary in a python function I wrote for finding ORFs.

The code used to create the dictionary

f=open('dna.example.fa') 
seqs={} 

for line in f:
    line=line.rstrip()
    if line[0]=='>': 
        words=line.split() 
        name=words[0][1:] 
        seqs[name]=''

    else:
        seqs[name]=seqs[name]+line

Code for parsing the file,

for name,seq in seqs.items():
     dna=(name,seq)

Code I wrote for finding ORFs

    def orf_finder1(dna, frame=1):
        start_codon=['ATG']
        stop_codon=['TAA', 'TAG', 'TGA']
        atg_pos=[]
        stp_pos=[]

        for i in range(frame, len(dna),3):
            codon=dna[i:i+3]
            if codon in start_codon:
                atg_pos.append(i)
                x=i
                for k in range(i+3, len(dna),3):
                    codons=dna[k:k+3]
                    if codons in stop_codon:
                        stp_pos=[]
                        s=k+3
                        l=len(dna[x:s])
                        d=dna[x:s]         
                        print(l)
                        print(d)
                        break
orf_finder1('AATGTTGACTAGCTAGCATGCAAGCTAGCTAA')
output:
15
ATGTTGACTAGCTAG

I haven't started learning biopython yet. Kindly someone please help me to feed the multifasta file into this function. Thank you.

python multi-fasta function • 2.6k views
ADD COMMENT
0
Entering edit mode
4.6 years ago

Hello,

Your code is working, provided that your multifasta is nice. (doesn't have blank lines). I would add only seqs[name]=seqs[name]+line.strip()

so you don't have line breaks (\n) in your fasta dictionary.

To your actual question. Your function expects string as first input. So you need to provide it.

e.g: (note that we only pass seq not the dna tuple)

for name, seq in seqs.items():
    dna = (name, seq)
    print(orf_finder1(seq))
ADD COMMENT
0
Entering edit mode

Thank you. But I want to use the dictionary so as to identify the ORFs corresponding to the fasta header. i.e. to print the ORFs, present in seqs.values for each seqs.key. I tried replacing the dna in the code with seqs.values but it throws error. May be I have to identify ORFs under each fasta header and loop over. But I am not getting a clue on how to do it.

ADD REPLY
1
Entering edit mode

Maybe I've misunderstood what you want.

The loop will go through every sequence in your dna.example.fa- (it would be great to have this file for any testing - this way I have to mock up my own example file) - and it will apply the orf_finder1 to the sequence. You can check that it traverses whole dict by adding print(name) to the for loop.

If I read your code correctly, your orf_finder1 will print only the first ORF it finds.

Do you want to return all ORFs in the sequence? If so, you need to rewrite the orf_finder1 function. This has nothing to do with your input file being multifasta or your data being stored in the dict.

If I may have an recommendation - Do try to learn BioPython, It's not panacea, but it will make your life simpler, especially for file reading writing.

Also, do check http://biopython.org/DIST/docs/tutorial/Tutorial.html especially section "20.1.13 Identifying open reading frames".

ADD REPLY

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6