Question

Best Way To Calculate Variation/Snp Density In The Ensembl Database?

3

Entering edit mode

13.6 years ago

Jeremy ▴ 30

Hi, I need to calculate the average variation density in human genes (number variations per gene) in the Ensembl database. I can't think of an easy way to do this. My plan of attack right now is to download every single SNP for every human gene via Biomart.

My question is: is there an easier and more efficient way to do this? I just want the total count of variations in gene coding regions.

The first option I thought of is to calculate variation density based on the total number of variations: e.g. there are 30,090,039 total variations in the DB and 46,737 predicted genes Therefore the variation density is 643.81 variations per gene

The issue here is that not all of the variations are in gene coding regions, so this is an overestimate.

variation snp ensembl • 5.8k views

ADD COMMENT • link 13.3 years ago by Jeremy ▴ 30

score 1 · Answer 1 · 2011-05-09

1

Entering edit mode

13.6 years ago

2184687-1231-83- ★ 5.1k

You may want to have a look at the density_feature scripts in the Ensembl API: ensembl/misc-scripts/density_feature/

These scripts are actually used during Ensembl production. With a bit of tweaking, you may be able to find the query you need.

ADD COMMENT • link 13.6 years ago by 2184687-1231-83- ★ 5.1k

Ram · Answer 2 · 2011-05-07

If I'm not wrong, the following mysql query should returns the number of SNP by GENE.id:

mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql


use homo_sapiens_core_62_37g;

select distinct
  GENE.gene_id,
  count(distinct VAR.variation_id)
from
   gene as GENE,
   homo_sapiens_variation_62_37g.variation_feature as VARFEAT,
   homo_sapiens_variation_62_37g.variation as VAR
where
   GENE.seq_region_id = VARFEAT.seq_region_id and
   VARFEAT.seq_region_start >= GENE.seq_region_start and 
   VARFEAT.seq_region_end <= GENE.seq_region_end and 
   VARFEAT.variation_id = VAR.variation_id
group by
    GENE.gene_id

See also this question to get the name of the genes: How To Retrieve Gene Variations From Ensembl Using The Perl Api?

score 0 · Answer 3 · 2011-05-08

I see 2 straight ways of solving this issue: the first option would be database programming, which is the one I would choose if you are able to build a nice query, because once you have to retrieve the data it makes sense to retrieve exactly what you're interested in (I don't know how inefficient Pierre's suggestion would be considering lh3's comment, but I guess I'd still use it if this is a single time query); the second one would be bulk downloading the data (genes and variants) and build a script to parse it. I'm most used to the later due to the kind of work I often do, so since database programming has been already covered I'll describe an alternative way of doing it, assuming that you are able to get a file with all genes's information (at least with start and end positions) and with all variants' information (at least with start positions).

but let me start by saying that when I face a nested iteration I always try to study it deeper in advance (pencil+paper+time), because if your design is not optimal you may loose a lot of time when executing your script, specially if it has to be performed several times. in this case you are dealing with variants positions (~30M) and with genes boundaries (~50K genes x2 start and end), so you have to be prepared to perform ~3000M comparisons.

I've seen several ways of performing this kind of region inclusion checks, but a very efficient yet not that elegant way of performing them is using Perl's hashing capabilities: once all the genes information is read and each gene's positions are stored in a hash, checking if a variant position is included on a gene region would only be a matter of evaluating an exists function:

# this code assumes that you have already read and stored gene information,
# and that the variants file contains only variants positions, so you'll
# have to adapt it to the format of the gene and variant file you may download.

# store all genes' positions
foreach $gene (@genes) {
  for ( $geneStarts{$gene} .. $geneEnds{$gene} ) {
    $genesPositions{$_} = "";
  }
}

# check variants' positions
open VARIANTSFILE, "variants.txt";
while ( $variantPosition = <VARIANTSFILE> ) {
  if ( exists $genesPositions{$variantPosition} ) {
    $geneVariants++;
  }
}
close VARIANTSFILE;

# calculate the variant density
$density = $geneVariants / ( scalar keys %genesPositions );

I've implemented this algorithm philosophy in the past and I can tell that it's very fast indeed. the main drawback is obviously storing all genes' positions through all the genome in the memory, but this can be overcome by working with 1 chromosome at a time (the script will need less than 1GB of memory only). note that memory consumption is completely independent of the number of variants, which is extremely useful when dealing with human variation through the whole genome.

score 0 · Answer 4 · 2011-05-16

I emailed the Ensembl helpdesk. Here is their reply:

use strict;
use Bio::EnsEMBL::Registry;

# get registry
my $reg = 'Bio::EnsEMBL::Registry';
my $registry_file = 'ensembl.registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org',-user =>
 'anonymous');

# get adaptor
my $ga = $reg->get_adaptor("human","core","gene");

foreach my $gene(@{$ga->fetch_all}) {
  my $count = scalar @{$gene->feature_Slice->get_all_VariationFeatures};

  print $gene->stable_id, "\t", $count, "\n";
}

But be warned this script will probably take a long time to run, and as a consequence may drop DB connection midway through.

It is also worth noting that Ensembl contains variations from many sources, not just those ascertained in "unbiased" ways. Well-studied genes, such as BRCA2, will show a disproportionately high variation density.