Question

get protein sequences from genbank

1

Entering edit mode

9.3 years ago

aproposdiver ▴ 10

I have this script below and it seems to be hard coded for files in the working directory. I basically know nothing about python but I want to make this script so it can read in a many genbank files instead of just be coded for one file at a time.

from Bio import GenBank

gbk_filename = "NC_005213.gbk"
faa_filename = "NC_005213_converted.faa"

input_handle = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))

output_handle.close()
input_handle.close()
print "Done"

genbank biopython • 10k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by aproposdiver ▴ 10

0

Entering edit mode

i am not familiar with scripts but instead I think you can retrieve Genbank Protein GI (from http://biodbnet.abcc.ncifcrf.gov/db/db2db.php) then in Batch Entrez you can retrieve the bulk protein sequence

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Yes but the main function I am looking for is to get the sequences parsed into separate files based on the different gb accession for the overall contig.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by aproposdiver ▴ 10

0

Entering edit mode

don't worry friend, your problem will be solved soon...

ADD REPLY • link 9.3 years ago by zizigolu ★ 4.3k

Ram · Answer 1 · 2015-07-23

4

Entering edit mode

9.3 years ago

Istvan Albert 101k

This will convert any file specified at the command line.

"""
Usage: converter.py foo.gbk

Produces a converted file: foo_converted.faa
"""
import sys, os
from Bio import GenBank, SeqIO

gbk_filename = sys.argv[1]
root_name = os.path.splitext(gbk_filename)[0]
faa_filename = root_name + "_converted.faa"

input_handle  = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))

output_handle.close()
input_handle.close()
print "Done"

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Istvan Albert 101k

0

Entering edit mode

Thanks for this. I tried it but got the error:

line 3
SyntaxError: Non-ASCII character '\xe2' in file /home/kpenn/Python.scripts.dir/Split.Genbank.to.faa.std.input.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

This is the script:

#!/usr/bin/python

# """
Usage: converter.py foo.gbk

Produces a converted file: foo_converted.faa
#"""

import sys, os
from Bio import GenBank, SeqIO

gbk_filename = sys.argv[1]
root_name = os.path.splitext(gbk_filename)[0]
faa_filename = root_name + "_converted.faa"

input_handle  = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))

output_handle.close()
input_handle.close()
print "Done"

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by aproposdiver ▴ 10

0

Entering edit mode

you must have added invalid characters to the script when you've edited it - remove those

ADD REPLY • link 9.3 years ago by Istvan Albert 101k

0

Entering edit mode

By the way: Be aware of pseudogenes, those (most probably?) won't have a translation. And this would trigger the assert. But you can avoid them by: if 'pseudo' not in seq_feature.qualifiers:

ADD REPLY • link 7.7 years ago by mschmid ▴ 180

0

Entering edit mode

Do you have the corrected script @mschmid?

ADD REPLY • link 7.0 years ago by felipelira3 ▴ 40

Ram · Answer 2 · 2015-09-28

0

Entering edit mode

9.2 years ago

aproposdiver ▴ 10

I cant seem to figure out how to modify this script to extract just the coding sequence in DNA format. I am pretty sure I need to use extract but have not been able to modify this properly to get that.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by aproposdiver ▴ 10

0

Entering edit mode

If you are willing to consider non-Python options, you can use e-utilities. I have answered a similar question to get amino acid sequences in FASTA format, given a chromosome accession.

To get protein coding, nucleotide sequences, you need to set the rettype option to fasta_cds_na

For chromosome accession NC_005213: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_015600&rettype=fasta_cds_aa&retmode=text

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Siva ★ 1.9k