I was directed here from another question. I posted an answer because the top voted answer, while correct, is very inefficient. As BioStar is a professional Q&A site, I think we should get this straight for ourselves and for other users connecting to the UCSC MySQL server.
If we check the UCSC table schemas, most of tables do not have chromStart and chromeEnd indexed, which means querying on these such columns naively will incur unnecessary data loading and thus discouraged. For overlapping queries, UCSC uses the mystic `bin' field, which is explained in the UCSC paper, the SAM spec and my tabix paper. Due to the use of this strategy, most of table joining and naive SQL are inefficient. One has to write multiple queries and use a small script to handle these. The following Perl source code shows how to compute bins that overlap a query region.
sub region2bin {
my ($beg, $end) = @_;
my @bin = (1);
push(@bin, ( 1 + ($beg>>26) .. 1 + (($end-1)>>26)));
push(@bin, ( 9 + ($beg>>23) .. 9 + (($end-1)>>23)));
push(@bin, ( 73 + ($beg>>20) .. 73 + (($end-1)>>20)));
push(@bin, (585 + ($beg>>17) .. 585 + (($end-1)>>17)));
return @bin;
}
and in SQL, we should explicitly query bins as is shown in the UCSC paper. We are professionals, and I hope our answers are also of the best quality.
EDIT:
As I have just tried, the naive SQL takes 6.5 seconds:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'set profiling=1;SELECT * FROM snp130 WHERE chrom="chr1" AND chromEnd>=100000000 AND chromStart<=100010000;show profiles'
while the SQL using the bin field only takes 0.0077 second (establishing the connection takes about 1 to 2 seconds):
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'set profiling=1;SELECT * FROM snp130 WHERE chrom="chr1" AND chromEnd>=100000000 AND chromStart<=100010000 AND (bin=1 OR bin=2 OR bin=20 OR bin=168 OR bin=1347 OR bin=1348);show profiles'
This is a huge difference. On smaller tables, the difference between the two SQLs will be smaller, but still matters. An easy way to write SQL is to use batchUCSC.pl. For example:
echo "chr1 100000000 100010000" | ./batchUCSC.pl -ed hg19 -p 'snp130:::'
The NHGRI curates a list of all published GWA studies: http://genome.gov/gwastudies/