Getting proteins from genomic interval of a genbank file
2
0
Entering edit mode
5.7 years ago
fhsantanna ▴ 620

I am trying to get all proteins from a genbank file considering a sequence interval. Here is the code of the script written in python:

#!/usr/bin/env python

#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end

import sys
from Bio import SeqIO

rec = SeqIO.read(sys.argv[1], 'genbank')
feats = [feat for feat in rec.features if feat.type == "CDS"]
start, end = sys.argv[2].split(':')

desired = set(range(int(start),int(end),1))

for f in feats:
    span = set(range(f.location._start.position, f.location._end.position))
    if span & desired:
        print("%s,%s,%d:%d\n%s\n" % (
                    f.qualifiers['locus_tag'][0],
                    f.qualifiers['product'][0],
                    f.location._start.position,
                    f.location._end.position,
                    f.qualifiers["translation"][0]))

There are two problems: For some regions containing pseudogenes, printing "f.qualifiers["translation"][0]" breaks the script;

$ python get_proteins.py ../gbk/NZ_AOLM01000025.1.gbk 92686:113021 > island_proteins.faa
Traceback (most recent call last):
  File "get_proteins.py", line 24, in <module>
    f.qualifiers["translation"][0]))
KeyError: 'translation'

I know that I must use "try", but I was not successful in doing it.

The second problem is that some genbank files give the following error:

$ python get_proteins.py ../gbk/NZ_CM001555.1.gbk 530979:551299 > island_proteins.faa
Traceback (most recent call last): File "get_proteins.py", line 17, in> <module>
 span = set(range(f.location._start.position, f.location._end.position)) AttributeError: 'CompoundLocation' object
has no attribute '_start'

I did not understand this last problem.

Thank you.

genbank proteins cds biopython • 2.3k views
ADD COMMENT
0
Entering edit mode

I think I managed the problem:

#!/usr/bin/env python

#Script for getting proteins in a range from a genbank file
#Usage: python get_proteins.py file start:end


#!/usr/bin/env python

#This script is a modification of the script found in Peter Cock's site (http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/).
# Usage: python gbk2faa.py  start:end 

import sys
from Bio import GenBank
from Bio import SeqIO

input_handle  = open(sys.argv[1], "r")
output_handle = open(sys.argv[3], "w")

desired_interval = sys.argv[2].split(':')
start = int(desired_interval[0])
end = int(desired_interval[1])
#desired_interval = set(range(int(start),int(end),1))

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print("Dealing with GenBank record %s" % seq_record.id)
    for seq_feature in seq_record[start:end].features :
        try: # Without "try", it crashes when it finds a CDS without translation (pseudogene).
            if seq_feature.type=="CDS" :
                assert len(seq_feature.qualifiers['translation'])==1
                output_handle.write(">%s,%s,%s,%s\n%s\n" % (
                    seq_feature.qualifiers['locus_tag'][0],
                    seq_feature.qualifiers['product'][0],
                    seq_record.id,
                    seq_record.description,
                    seq_feature.qualifiers['translation'][0]))
                pass
        except:
            continue

output_handle.close()
input_handle.close()
print("Done")
ADD REPLY
3
Entering edit mode
5.7 years ago
Joe 21k

I've built something around that exact code before, for this problem: https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py

It requires that the file be a single contiguous genbank (for now). It prints the features, but it should be easy to alter the output to get fasta sequences or similar via Biopython.

ADD COMMENT
1
Entering edit mode
5.7 years ago

using xslt and then filter the TSV output with awk

:

ADD COMMENT
0
Entering edit mode

Thank you very much. But for 100,000 files this is not feasible.

ADD REPLY
0
Entering edit mode

But for 100,000 files this is not feasible.

a loop ?

ADD REPLY
0
Entering edit mode

Sorry, you are right. I did not pay attention. Maybe your solution is faster than mine. I'll test it. Thank you again.

ADD REPLY
0
Entering edit mode

Could you provide your awk code, please?

ADD REPLY
1
Entering edit mode

something like (not tested):

awk -v B=123  -v E=456 '{x1=int($2);x2=int($3);b=x1<x2?x1:x2;e=x1<x2?x2:x1; if (!(b>int(E) || e<int(B))) print; }' input.tsv
ADD REPLY

Login before adding your answer.

Traffic: 1707 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6