Python: Finding Gene Closest To A Given Location
3
2
Entering edit mode
12.2 years ago
bbio ▴ 90

What is the preferred way to, given locations in a genome, find the first genes that are upstream or downstream of these locations?

I have been looking into processing a GFF file using BioPython, but it is taking an insanely long time (2h+) to parse a file other tools can parse in seconds. I also considered using Ensembl, but there also doesn't seem to be a good API for Python there.

So what would be my best course of action here? Use MySqlDb to hook straight into the Ensembl database? Parse the GFF file myself?

python annotation gff parsing ensembl • 9.7k views
ADD COMMENT
7
Entering edit mode
12.2 years ago

You can use bedtools closestBed

closestBed 

Summary: For each feature in A, finds the closest 
     feature (upstream or downstream) in B.

Usage:   closestBed [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

See also:

How to map a SNP to a gene around +/- 60KB ?

ADD COMMENT
0
Entering edit mode

That seems to do exactly what I want, thank you! So I guess the best way to do this in Python is to not use Python...

ADD REPLY
0
Entering edit mode

there is a implementation of bedtools incase you need to use a a part of a script. http://packages.python.org/pybedtools/

ADD REPLY
3
Entering edit mode
12.1 years ago
brentp 24k

Since noone has mentioned it, you can use BEDTools and python (your cake and eat it too) with pybedtools

With that, the code would be something like:

import sys
from bedtools import BedTool

locs_file = sys.argv[1]
genes_file = sys.argv[2]

for nearest in BedTool(locs_file).closest(genes_file):
    print nearest
ADD COMMENT
1
Entering edit mode

Thank you for the suggestion. Do you know how well this performs compared to the command-line bedtools?

ADD REPLY
0
Entering edit mode

BedTool(locs_file).closest(genes_file) will be virtually identical in performance to the command-line. Looping over the result will incur some overhead because it is converting each line into a python object. But generally, speed is very good, most of the performance-critical parts of pybedtools are written in C.

ADD REPLY
2
Entering edit mode
12.2 years ago
DG 7.3k

Pierre's bedtools approach is probably the best. If you are set on developing you're own in-house application with Python I would recommend using bx.python and IntervalTrees. You can parse a feature list (such as a gene list with coordinates and chromosome) in to IntervalTrees (one per chromosome) and look for the closest intervals to your position(s) of interest. However bedtools will work much quicker and be easier "out-of-the-box".

But learning IntervalTrees can be quite handy for dealing with genomic data in general, especially with Python, so it may be something worth looking in to as a learning exercise.

ADD COMMENT
1
Entering edit mode

Yeah intervals trees are a good way to go. I posted some code on a python implementation recently: http://blog.nextgenetics.net/?e=45

ADD REPLY
0
Entering edit mode

Looks good. There is the implementation in bx.python, which includes a cython implementation already, I highly recommend checking it out as it is a pretty robust library

ADD REPLY
0
Entering edit mode

I think I will go with the easy solution above for now, but IntervalTrees are definitely something I will have to play around with in the future, so thank you for the suggestion!

ADD REPLY
0
Entering edit mode

I agree, I think the bedtools solution is optimal for your current needs. I am using IntervalTrees pretty heavily though for downstream data analysis and additional annotations that I am doing as part of my pipeline.

ADD REPLY

Login before adding your answer.

Traffic: 2522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6