I am looking for a way to automate with Python looking up sequence annotations from a GFF file (or a GBFF file) from genomes stored by NCBI.
I have a fasta file with kmers pulled from an NCBI genome. These kmers have coordinates such as the accession id, the contig id, and the start and end of the kmer. From that information, I would like a list of all annotations that cover or include that sequence as found in the GFF file. I would like to do it in Python.
I believe that BioPython has an experimental GFF file parser, but I do not know how to use it. It seems like there should be an easy way to do what I want with Python without having to write my own tool.
way to do what I want with Python
...without having to write my own tool
If a tool already exists, and you don't want to write one, why would you care that it was written in python? These 2 statements are kind of mutually exclusive - you either do or do not want to get your hands dirty coding something up in python (which for the record is absolutely what I'd recommend doing).
That aside, am I understanding correctly that you have a list of kmers from
genome X
and a GFF ofgenome X
's annotations, and you want to 'intersect' the 2? Or you want to find all features which contain these kmers from many genomes in NCBI?I am writing code to analyze the data in Python, so a tool in Python is desirable. This is not mutually exclusive with writing my own GFF parser/extractor tool since such a thing could exist. Why would I want to make my own if such a thing exists already?
I have specific DNA sequences with coordinates from genome X, and I want to see how they would be annotated by the GFF file from genome X. So, I want to know what the annotation is for a given locus from genome X.
You're looking for a library/package, not a tool. A tool is standalone and does not usually allow for direct integration into a python script, whereas a package is something you can use to make otherwise complicated processes (such as parsing specific file formats) easier.
A package could generically be called a tool. Although a stand-alone program that does what I want would be useful as well.
BioPython doesn't actually have a GFF parser as such, but it will incorporate Brad Chapman's
BCBio
eventually AFAIK.Nevertheless,
BCBio
was written to interface directly with BioPython's object model, so you use the parser as you would any other BioPython parser in effect.If you already have the coordinates of your kmers, its easy to probe a sequence for whether or not that coordinate resides in a feature. The code in this post should help.
I don't have time to write anything at the moment, but essentially this should boil down to a case of the following approach (pseudo-code):
Something along those lines anyway just off the top of my head. Not sure how big your data is but they could be 2 fairly significant nested loops, so this might not be fast. There could be a better way, but conceptually at least, that's how I might do it.