I see 2 straight ways of solving this issue: the first option would be database programming, which is the one I would choose if you are able to build a nice query, because once you have to retrieve the data it makes sense to retrieve exactly what you're interested in (I don't know how inefficient Pierre's suggestion would be considering lh3's comment, but I guess I'd still use it if this is a single time query); the second one would be bulk downloading the data (genes and variants) and build a script to parse it. I'm most used to the later due to the kind of work I often do, so since database programming has been already covered I'll describe an alternative way of doing it, assuming that you are able to get a file with all genes's information (at least with start and end positions) and with all variants' information (at least with start positions).
but let me start by saying that when I face a nested iteration I always try to study it deeper in advance (pencil+paper+time), because if your design is not optimal you may loose a lot of time when executing your script, specially if it has to be performed several times. in this case you are dealing with variants positions (~30M) and with genes boundaries (~50K genes x2 start and end), so you have to be prepared to perform ~3000M comparisons.
I've seen several ways of performing this kind of region inclusion checks, but a very efficient yet not that elegant way of performing them is using Perl's hashing capabilities: once all the genes information is read and each gene's positions are stored in a hash, checking if a variant position is included on a gene region would only be a matter of evaluating an exists function:
# this code assumes that you have already read and stored gene information,
# and that the variants file contains only variants positions, so you'll
# have to adapt it to the format of the gene and variant file you may download.
# store all genes' positions
foreach $gene (@genes) {
for ( $geneStarts{$gene} .. $geneEnds{$gene} ) {
$genesPositions{$_} = "";
}
}
# check variants' positions
open VARIANTSFILE, "variants.txt";
while ( $variantPosition = <VARIANTSFILE> ) {
if ( exists $genesPositions{$variantPosition} ) {
$geneVariants++;
}
}
close VARIANTSFILE;
# calculate the variant density
$density = $geneVariants / ( scalar keys %genesPositions );
I've implemented this algorithm philosophy in the past and I can tell that it's very fast indeed. the main drawback is obviously storing all genes' positions through all the genome in the memory, but this can be overcome by working with 1 chromosome at a time (the script will need less than 1GB of memory only). note that memory consumption is completely independent of the number of variants, which is extremely useful when dealing with human variation through the whole genome.
-1. In the GENE table, seq_region_start/end are not indexed. In the VARFEAT table, seq_region_end is not indexed. The SQL above will very inefficient. It is important to read the table schema before writing queries.