Modify location of a GenBank feature
2
1
Entering edit mode
10.5 years ago
kavin.pl ▴ 70

I am trying to modify the location of features within a GenBank file. I know feature.type will give gene/CDS and feature.qualifiers will then give "db_xref"/"locus_tag"/"inference" etc. Is there a feature. object which will allow me to access the location (eg: [5240:7267](+)) directly?

This URL give a bit more info, though I can't figure out how to use it for my purpose... http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html#location_operator

Essentially, I want to modify the following bit of a GenBank file:

 gene            5240..7267
                 /db_xref="GeneID:887081"
                 /locus_tag="Rv0005"
                 /gene="gyrB"
 CDS             5240..7267
                 /locus_tag="Rv0005"
                 /inference="protein motif:PROSITE:PS00177"
                 ...........................

to

 gene            5357..7267
                 /db_xref="GeneID:887081"
                 /locus_tag="Rv0005"
                 /gene="gyrB"
 CDS             5357..7267
                 /locus_tag="Rv0005"
                 /inference="protein motif:PROSITE:PS00177"
                 .............................

Note the changes from 5240 to 5357

So far I have the following python script:

from Bio import SeqIO
gb_file = "mtbtomod.gb"
gb_record = SeqIO.parse(open(gb_file, "r+"), "genbank")
rvnumber = 'Rv0005'
newstart = 5357

final_features = []

for record in gb_record:
  for feature in record.features:
    if feature.type == "gene":
        if feature.qualifiers["locus_tag"][0] == rvnumber:
            if feature.location.strand == 1:
                # Amend feature location from current to 'newstart'
            else:
                # do the reverse for the complementary strand
    final_features.append(feature)
  record.features = final_features
  with open("test.gb","w") as test:
    SeqIO.write(record, test, "genbank")

Rv0005 is just an example of a locus_tag I need to update. I have about 600 more locations to update per GenBank file, and about 10-20 GenBank files to process (with more to come)

Genbank python R • 5.7k views
ADD COMMENT
1
Entering edit mode

Cross-post from http://stackoverflow.com/questions/24636588/modify-location-of-a-genbank-feature

I guess that you didn't get an answer there. But the link and explanations over there can be useful.

ADD REPLY
0
Entering edit mode

Yes indeed - although the RE.type way is not totally appropriate and I thought there may be more GenBank centric people over here...

ADD REPLY
2
Entering edit mode
10.5 years ago
Hugues ▴ 250

Where is the question? Do you get an error message?

Well I do get one: when I do something like:

features.location.start.position = newstart

I get an error:

seqfeature AttributeError can't set attribute

And it doesn't matter whether the genbank file was open read-only as you did, or in writing mode.

I suppose that you need to create a new genbank file, and copy all the attributes.

Edit 1:

Well modifying the start is possible after all.

from Bio import SeqIO
from Bio import SeqFeature
start_pos = SeqFeature.AfterPosition(newstart)
end_pos = SeqFeature.BeforePosition(feature.location.end.position)
my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
features = my_location

But that does not solve the saving problem yet.

Edit 2:

Useful links:

ADD COMMENT
1
Entering edit mode

Basically the "can't set attribute" error is the reason for my question. I don't know how to modify the locations while importing the features into a new GB file. Any ideas on how to do that?

All I can do is add a new qualifier through

feature.qualifiers["amend_position"] = "%s:%s" % (newstart, feature.location.end+1)

but that keeps the locations (ie, gene 5240..7267) the same while adding an extra qualifier (i.e., /amend_position="5357:7267")

ADD REPLY
1
Entering edit mode

"how to modify the locations" -> I've answered in my Edit 1. Please try it out and let me know.

"while importing the features into a new GB file" -> I haven't tried hard yet. Have a go at it yourself. Edit your question where you show us your attempts.

ADD REPLY
1
Entering edit mode

As easy as this...

start_pos = SeqFeature.ExactPosition(newstart)
end_pos = SeqFeature.ExactPosition(feature.location.end.position)
my_location = SeqFeature.FeatureLocation(start_pos, end_pos, feature.location.strand)
feature.location = my_location

Thanks for pointing out the Seq.Feature implementation :)

Now, on to modifying the CDS as well, and making sure this works in the complementary strand.. All this before implementing a code to look through an excel sheet and pick up the strand and newstarts.

ADD REPLY
0
Entering edit mode

Show me some love. Vote me up! I'm paid in rep. points :-)

ADD REPLY
2
Entering edit mode
10.5 years ago
kavin.pl ▴ 70

Ok, I have something which now fully works. I'll post the code in case anyone ever needs something similar

__author__ = 'Kavin'

from Bio import SeqIO
from Bio import SeqFeature
import xlrd
import sys
import re

workbook = xlrd.open_workbook(sys.argv[2])
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]

# Create dicts to store TSS data
TSS = {}
row = {}
# For each entry (row), store the startcodon and strand information
for i in range(2, sheet.nrows - 1):
    if data[i][5] < -0.7:   # Ensures BASS score is within significant range
        Gene = data[i][0]
        row['Direction'] = str(data[i][3])
        row['StartCodon'] = int(data[i][4])
        TSS[str(Gene)] = row
        row = {}
    else:
        i += 1

# Create an output filename based on input filename
outfile_init = re.search('(.*)\.(\w*)', sys.argv[1])
outfile = str(outfile_init.group(1)) + '_modified.' + str(outfile_init.group(2))

final_features = []
for record in SeqIO.parse(open(sys.argv[1], "r"), "genbank"):
    for feature in record.features:
        if feature.type == "gene" or feature.type == "CDS":
            if TSS.has_key(feature.qualifiers["locus_tag"][0]):
                newstart = TSS[feature.qualifiers["locus_tag"][0]]['StartCodon']
                if feature.location.strand == 1:
                    feature.location = SeqFeature.FeatureLocation(SeqFeature.ExactPosition(newstart - 1),
                                                                  SeqFeature.ExactPosition(
                                                                      feature.location.end.position),
                                                                  feature.location.strand)
                else:
                    feature.location = SeqFeature.FeatureLocation(
                        SeqFeature.ExactPosition(feature.location.start.position),
                        SeqFeature.ExactPosition(newstart), feature.location.strand)
        final_features.append(feature)  # Append final features
    record.features = final_features
    with open(outfile, "w") as new_gb:
        SeqIO.write(record, new_gb, "genbank")

This assumes usage such as python program.py <genbankfile> <excel spreadsheet>

This also assumes a spreadsheet of the following format:

Table S3. TSS mapping and re-annotation of start codons.
Gene       Synonym     Tuberculist annotated start     Orientation     Re-annotated start*     BASS score*
Rv0005     gyrB        5240                            +               5357                    -1.782
Rv0012     Rv0012      14089                           +               14134                   -1.553
Rv0018c    pstP        23181                           -               23172                   -2.077
Rv0032     bioF2       34295                           +               34307                   -0.842
Rv0037c    Rv0037c     41202                           -               41163                   -0.554
Rv0038     Rv0038      41304                           +               41307                   -0.581

Edit:

I embellished the code a touch

ADD COMMENT
0
Entering edit mode

Thanks for posting your code / solution, I'm sure it can help others.

ADD REPLY

Login before adding your answer.

Traffic: 1039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6