Greetings
I am currently using a relational database to store snps and genomic sequence features. I am trying to find out whether snps reside in a sequence feature and I am finding it very slow even with judicious application of indexes and batch updates in a transaction. For example this query (which is pseudocode and just an abstraction of what i am doing for convenience)
update snp, exon set snp.isexonic = true where snp pos >= exon.start and snp.pos <=exon.end
is extremely slow even with indexes on all of the start and end positions and the updates performed in a single commit so that indexes aren't updated until the end (although there aren't indexes on the updated field). I found it much faster to loop around every entry in the exon table and get the start and end position for each exon and then do an update
for each exon
update snp set isexonic = true where snp.pos >= ? and snp.pos <=?
where ? are the placeholders for the positions of each exon in turn. But even this is painfully slow due to the volume of data. I was wondering if there were other techniques that I am not aware of for working with this type of data. I asked a perl programmer whether it would be quicker to extract the data into perl structures, perform the comparisons and then update the database but they suggested that was overkill and I should stick with the database.
Are there any database techniques that I could use to speed this up? Is there a column type that might help. I presume this must be a common problem in bioinformatics so I thought I would ask the esteemed members of this forum for guidance.
I've seen the binning scheme used at UCSC but do not know enough about it to know whether this is an option I could employ on my own local data or whether that too is overkill
thank you
See page 1003 and Figure 7 of: Kent et al. The human genome browser at UCSC. Genome Res (2002) vol. 12 (6) pp. 996-1006 There's even an example query. In short, you would have to loop through all of the SNPs in each 128kb bin to detect overlaps. The idea is that while there may be many false positives in that bin, it will be far less than for the whole genome. All in all, databases are not well-suited to this problem.
hi. it was that figure that confused me because the smallest bin in the diagram is 1bp :) If the paper is just saying that once you have found your bin your have to loop through all the features in that bin (rather than the whole genome) then I understand the process. I thought I was missing something more profound than that.
i've removed the edit as it came out a bit big!