Question

Getting GFF annotation from sequence coordinates.

0

Entering edit mode

5.9 years ago

jsporter ▴ 60

I am looking for a way to automate with Python looking up sequence annotations from a GFF file (or a GBFF file) from genomes stored by NCBI.

I have a fasta file with kmers pulled from an NCBI genome. These kmers have coordinates such as the accession id, the contig id, and the start and end of the kmer. From that information, I would like a list of all annotations that cover or include that sequence as found in the GFF file. I would like to do it in Python.

I believe that BioPython has an experimental GFF file parser, but I do not know how to use it. It seems like there should be an easy way to do what I want with Python without having to write my own tool.

GFF NCBI gene annotation Python • 5.3k views

ADD COMMENT • link 5.9 years ago by jsporter ▴ 60

0

Entering edit mode

way to do what I want with Python ... without having to write my own tool

If a tool already exists, and you don't want to write one, why would you care that it was written in python? These 2 statements are kind of mutually exclusive - you either do or do not want to get your hands dirty coding something up in python (which for the record is absolutely what I'd recommend doing).

That aside, am I understanding correctly that you have a list of kmers from genome X and a GFF of genome X's annotations, and you want to 'intersect' the 2? Or you want to find all features which contain these kmers from many genomes in NCBI?

ADD REPLY • link 5.9 years ago by Joe 21k

0

Entering edit mode

I am writing code to analyze the data in Python, so a tool in Python is desirable. This is not mutually exclusive with writing my own GFF parser/extractor tool since such a thing could exist. Why would I want to make my own if such a thing exists already?

I have specific DNA sequences with coordinates from genome X, and I want to see how they would be annotated by the GFF file from genome X. So, I want to know what the annotation is for a given locus from genome X.

ADD REPLY • link 5.9 years ago by jsporter ▴ 60

0

Entering edit mode

You're looking for a library/package, not a tool. A tool is standalone and does not usually allow for direct integration into a python script, whereas a package is something you can use to make otherwise complicated processes (such as parsing specific file formats) easier.

ADD REPLY • link 5.9 years ago by Ram 44k

0

Entering edit mode

A package could generically be called a tool. Although a stand-alone program that does what I want would be useful as well.

ADD REPLY • link 5.9 years ago by jsporter ▴ 60

0

Entering edit mode

BioPython doesn't actually have a GFF parser as such, but it will incorporate Brad Chapman's BCBio eventually AFAIK.

Nevertheless, BCBio was written to interface directly with BioPython's object model, so you use the parser as you would any other BioPython parser in effect.

If you already have the coordinates of your kmers, its easy to probe a sequence for whether or not that coordinate resides in a feature. The code in this post should help.

I don't have time to write anything at the moment, but essentially this should boil down to a case of the following approach (pseudo-code):

# Parse the file in as you would with BioPython.
# You should end up with SeqRecords for each sequence, which will automatically create features accordingly

records = BCBIO.parse('myfile.gff', 'gff')
for record in records:
    features = [feature for feature in record.features]

# Loop all your features/kmers (whichever way round), and test whether the 2 genomic coordinate ranges overlap
for coord in kmerlist:
    for feature in features:
        if coord in feature.location():
            print(feature)

Something along those lines anyway just off the top of my head. Not sure how big your data is but they could be 2 fairly significant nested loops, so this might not be fast. There could be a better way, but conceptually at least, that's how I might do it.

ADD REPLY • link 5.9 years ago by Joe 21k

score 1 · Accepted Answer · 2019-01-10

Thanks for the help. The documentation for BCBio recommended gffutils, and I was able to do what I wanted with gffutils. It creates a database of the gff file for efficiency, and it can handle some different specifications of gff files.

import gffutils
gff_in = "filename.gff"
gff_out = "filename.db"
#Specifying the id_spec was necessary for gff files from NCBI.
db = gffutils.create_db(gff_in, gff_out, id_spec={'gene': 'db_xref'})
features = db.region(seqid="TheSeqIDFromTheFastaFile", start=11877, end=11974)
for f in features:
      print(f.featuretype)