Question

Find Nearest Gene Upstream Using Mysql And Perl Program

3

Entering edit mode

11.8 years ago

anon111 ▴ 40

There is a table of genes in mysql. I have a list of reg. elements, their start position on a chromosome, and their end position on a chromosome. How do you write a mysql command for perl program to search for the gene (from a table of genes downloaded in mysql) nearest each element. How do you make the perl program run faster since there are many elements for which you have to search for the nearest gene?

EDIT*

I am new to mysql and perl and have little programming background. Could you explain a little more? I must use a table of genes (that has over a 150K genes) that I downloaded into a mysql database. So i have a regulatory element and its start position, end position, and chromosome number. I have a table of genes. how do i match the gene closest upstream or downstream to this reg. element most easily? i will need to include a subroutine for finding the closest gene too. The thing I'm stuck on is how to find the closest gene.

• 5.6k views

ADD COMMENT • link updated 11.7 years ago by Istvan Albert 102k • written 11.8 years ago by anon111 ▴ 40

0

Entering edit mode

to the OP please don't delete this post as it contains a few outstanding contributions

ADD REPLY • link 11.7 years ago by Istvan Albert 102k

score 6 · Answer 1 · 2013-02-27

If you want to consider an alternative approach:

Export the gene table from MySQL to a UCSC-formatted BED file (you'll want at least four tab-delimited columns: chromosome, start, stop and gene name)
Export your regulatory elements to another UCSC-formatted BED file (you'll want at least three columns: chromosome, start and stop)
Both BED files should first be sorted, _e.g._ with BEDOPS sort-bed or alternatives:

$ sort-bed rawRegulatoryElements.bed > regulatoryElements.bed

$ sort-bed rawGenes.bed > genes.bed

Sorting allows other BEDOPS tools (like the one I'm about to mention shortly) to operate and return results very quickly.
Run BEDOPS closest-features on the regulatory elements and gene table BED files, _e.g._:

$ closest-features --delim '\t' --shortest regulatoryElements.bed genes.bed > answer.bed

This command reports the nearest gene upstream or downstream to each regulatory element. Results are stored in a file called answer.bed, which is also BED-formatted.

score 4 · Answer 2 · 2013-02-27

I spent a lot of time digging down that rabbit hole. If you want to follow, you can use something like this:

SELECT * from (
 (SELECT * FROM ((SELECT * FROM feature WHERE start<= ? ORDER BY start DESC  LIMIT 1) UNION (SELECT * FROM feature where start>= ?
ORDER BY start LIMIT 1)) as u)
 UNION
 (SELECT * FROM ((SELECT * FROM feature where stop<= ? ORDER BY stop   DESC  LIMIT 1) UNION (SELECT * FROM feature where stop>= ?
ORDER BY stop LIMIT 1)) as v)
) as w
ORDER BY ABS((start + stop)/2 - ?) LIMIT 1

To get the nearest neighbor using indexes (if your table has btree indexes on start and end). But, unless you have only a few intervals, it's simpler and faster to read the database features into some kind of an interval tree and query from there.

score 2 · Answer 3 · 2013-02-27

You would need to "nail down" in the SQL query the coordinate and say that you want to get all the entities with a position above or below the specified one, then you need to LIMIT the set which SQL returns to 1 entity, something like:

select * from `gene` where `start` >10000 order by `start` limit 1;

If you need to see which gene is closer, upstream or downstream, - do it in Perl. Don't worry about performance in Perl, MySQL is pretty smart to figure out that stuff for you. But, I would recommend to index your table by the gene start/end positions.

p.s editing - forgot to put order by....

score 2 · Answer 4 · 2013-02-27

2

Entering edit mode

11.8 years ago

Sean Davis 27k

If you want to use mysql, consider using either spatial indexing or hack indexes using code such as that used by UCSC and tabix. See the last section of the SAM format document for details on creating and using a simple indexing scheme. In UCSC tables, you will often see a column with the name "bin"; this is another implementation of the same scheme.

ADD COMMENT • link 11.8 years ago by Sean Davis 27k

0

Entering edit mode

well, could you tell us, why not to use mysql?

ADD REPLY • link 11.8 years ago by Pavel Senin ★ 1.9k

1

Entering edit mode

I should have been less directive. I'd suggest benchmarking the actual application using other approaches. Edited to reflect that change.

ADD REPLY • link 11.8 years ago by Sean Davis 27k

0

Entering edit mode

because, there are a ton of edge-cases. in your query below, what if you have 10 genes with start == 10000? did you remember to select by chromosome? what if you need to consider strand?

ADD REPLY • link 11.8 years ago by brentp 24k

0

Entering edit mode

I agree that it takes some time to go through hoops, like the one you say about chromosome ;), however SQL was specifically designed to retrieve subsets of data entities in meaningful way in a very fast manner, with a great degree of flexibility, moreover, it integrates with any modern language. I would not exclude SQL from workflow just because query gets long - it stays Structured.

ADD REPLY • link 11.8 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

To answer your question, performance is likely to be an issue compared to the other toolsets I mentioned. However, performance is always a tricky thing to pin down which is why I suggest benchmarking.

ADD REPLY • link 11.8 years ago by Sean Davis 27k

score 1 · Answer 5 · 2013-02-27

1

Entering edit mode

11.8 years ago

Sean Davis 27k

Take a look at the ChiPpeakAnno Bioconductor package here: http://www.bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html. The second use case in the vignette is on finding the nearest gene to a feature.

ADD COMMENT • link 11.8 years ago by Sean Davis 27k

score 1 · Answer 6 · 2013-02-28

Since Sean is breaking out the R package :). There's also a python package that can do this: https://pypi.python.org/pypi/cruzdb

Syntax would be something like:

hg19 = Genome('hg19')
for chrom, start, end in list_of_intervals:
    print hg19.knearest('refGene', chrom, start, end, k=2)

or if you want upstream only, replace "knearest" with "upstream".