Question

Fast Method To Detect Whether Snp Lies In Sequence Feature

4

Entering edit mode

13.6 years ago

Pi ▴ 520

Greetings

I am currently using a relational database to store snps and genomic sequence features. I am trying to find out whether snps reside in a sequence feature and I am finding it very slow even with judicious application of indexes and batch updates in a transaction. For example this query (which is pseudocode and just an abstraction of what i am doing for convenience)

update snp, exon set snp.isexonic = true where snp pos >= exon.start and snp.pos <=exon.end

is extremely slow even with indexes on all of the start and end positions and the updates performed in a single commit so that indexes aren't updated until the end (although there aren't indexes on the updated field). I found it much faster to loop around every entry in the exon table and get the start and end position for each exon and then do an update

  for each exon
          update snp set isexonic = true where snp.pos >= ? and snp.pos <=?

where ? are the placeholders for the positions of each exon in turn. But even this is painfully slow due to the volume of data. I was wondering if there were other techniques that I am not aware of for working with this type of data. I asked a perl programmer whether it would be quicker to extract the data into perl structures, perform the comparisons and then update the database but they suggested that was overkill and I should stick with the database.

Are there any database techniques that I could use to speed this up? Is there a column type that might help. I presume this must be a common problem in bioinformatics so I thought I would ask the esteemed members of this forum for guidance.

I've seen the binning scheme used at UCSC but do not know enough about it to know whether this is an option I could employ on my own local data or whether that too is overkill

thank you

database • 4.7k views

ADD COMMENT • link updated 13.6 years ago by Michael 55k • written 13.6 years ago by Pi ▴ 520

1

Entering edit mode

See page 1003 and Figure 7 of: Kent et al. The human genome browser at UCSC. Genome Res (2002) vol. 12 (6) pp. 996-1006 There's even an example query. In short, you would have to loop through all of the SNPs in each 128kb bin to detect overlaps. The idea is that while there may be many false positives in that bin, it will be far less than for the whole genome. All in all, databases are not well-suited to this problem.

ADD REPLY • link 13.6 years ago by Aaronquinlan 12k

0

Entering edit mode

hi. it was that figure that confused me because the smallest bin in the diagram is 1bp :) If the paper is just saying that once you have found your bin your have to loop through all the features in that bin (rather than the whole genome) then I understand the process. I thought I was missing something more profound than that.

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

i've removed the edit as it came out a bit big!

ADD REPLY • link 13.6 years ago by Pi ▴ 520

score 7 · Answer 1 · 2011-04-29

7

Entering edit mode

13.6 years ago

Istvan Albert 102k

I think your best approach is to compute the necessary flags outside of the database with bedtools or some other tool/algorithm (like tabix) designed to deal with intervals (but not Perl or Python unless you have the interval specific code written in C). It is very easy to dump the data into GFF or BED, compute the overlaps and generate the sql necessary to update your database.

I think this will save you a lot of time and aggravation.

ADD COMMENT • link 13.6 years ago by Istvan Albert 102k

2

Entering edit mode

This is correct, a RDBMS is not the right tool to support inteval-overlapping. In addition we should mention (again) IRanges in Bioconductor, which is imho the most advanced and easy to use tool for this task.

ADD REPLY • link 13.6 years ago by Michael 55k

0

Entering edit mode

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level (smallest) bin?

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say that the smallest bin/interval level would have to be 1bp. I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level. Are there guides for chosing the interval size because the ucsc paper says their smallest bin was 128kb. This seems quite large

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say that the smallest bin/interval level would have to be 1bp. I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level. Are there guides for chosing the interval size because the ucsc paper says their smallest bin was 128kb. This seems quite large and wouldn't let you find colocated snps would it?

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

Also, how do you recommend calculating the bins/intervals needed to query/retrieve data from the databse? Do i need to use tabix to calculate the intervals for retrieval? I'm also interested as to why you say don't use perl or python to deal with the intervals. Is it because it would be too slow? Would perl/python be ok to calculate 'one-off' intervals needed for a retrieval query or should they be avoided altogether

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

I think i might have confused the issue with the 'isexonic' flag in the question. That was an abstraction for this question. Most of the time I will not be setting flags like this (though sometimes i do) but querying my database to retrieve overlapping features such as colocated snps or snps in exons. In that case i need to give each feature a bin or interval when it is added to the database and then calcuate bins for queries. I think tabix and bedtools will calcuate the interval/bin for a feature but not work out the bin needed to query a database? If that is correct what would you recommend

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

I should also add I've never used R or Bioconductor so that woudl not be the first choice for me given my current skill set

ADD REPLY • link 13.6 years ago by Pi ▴ 520

Ram · Answer 2 · 2011-04-28

3

Entering edit mode

13.6 years ago

Pierre Lindenbaum 164k

Yes you can use a 'binning' algorithm to speed up this kind of query. This kind of algorithm is used in most tables of the ucsc. See:

http://genome.cshlp.org/content/12/6/996.full "The Human Genome Browser at UCSC"
my blog http://plindenbaum.blogspot.com/2010/05/binning-genome.html
http://genomewiki.ucsc.edu/index.php/Bin_indexing_system

Other suggestions:

have you created some indexes in your database ?
Can you tell your SQL engine that your data are already sorted ?
Can you first create a temporary table (no transaction) containing the new data and then later, update your whole database (transaction).
(...)

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

how would i tell a mysql engine that the rows in the table are sorted. does that mean it can scan the index from the point it last looked rather than going from the beginning? have you ever used the binning scheme yourself? Thank you

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

how would i tell a mysql engine that the rows in the table are sorted. does that mean it can scan the index from the point it last looked rather than going from the beginning? Thank you

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

i've read your blog and seen the function that you create to find the appropriate bins to search. I've also read the ucsc paper but i dont understand how i would create the bins on my own database tables and assign a feature to a bin. I Don't supposed you have a blog about that :o)

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

do i just use the java function to assign the genomic feature to a bin and then store the bin in a database field bin?

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

i've seen your blog. great job! do i just use the java function to assign the genomic feature to a bin and then store the bin in a database field bin?

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

it is not standard, but I think there are some SQL engines optimizing their queries if they 'know' that the data are sorted.

ADD REPLY • link 13.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yes, you can use the java function or the original Kent's code.

ADD REPLY • link 13.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

have you tried in on a database in practice yourself. It seems straightforward (now you have written the function!!!)

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

looking at your blog again, did you do what the commentator suggested and compared the bin to a concatenated index on the start and end fields?

ADD REPLY • link 13.6 years ago by Pi ▴ 520

0

Entering edit mode

no, I didn't .

ADD REPLY • link 13.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Kent has compared that in the 2002 paper. PostgreSQL can build R-tree index as I remember.

ADD REPLY • link 13.6 years ago by lh3 33k

Ram · Answer 3 · 2011-04-29

The question for fast interval overlapping has been asked several times here,e.g. : What Is The Quickest Algorithm For Range Overlap? A SNP is an interval of width 1. Why does it take so long with an RDBMS? Because you are using the naive algorithm which requires 2 * # SNPs * # of Genes comparisons, while with a more efficient algorithm, a lot of comparisons turn out to be redundant. I recommend to not try to attempt this in SQL but use an external implementation as mentioned in the previous question.