How To Retrieve Gene Variations From Ensembl Using The Perl Api?
4
2
Entering edit mode
13.7 years ago

Dear all,

A very practical question. I need to retrieve all variations (loose sense) of the MEN1 gene from the human genome at ensembl. Of course, I want to do it programmatically. But, I'm having a hard time using the perl API. I think I didn't grasp the adaptor idea (maybe the nomenclature scheme). Has someone a hint on how to use a gene ID to recover variations?

I've tried some examples from the tutorial. But, something went wrong.

ensembl variation perl api • 8.0k views
ADD COMMENT
6
Entering edit mode
13.7 years ago

Hi,

The numbers you are looking at in the variation table on the gene page in Ensembl are the number of different consequences that affects the transcripts of the gene by variations that fall within it. In other words, one single variation can have multiple consequences on one transcript (i.e. 'intronic' and 'essential splice site') and it can also have different consequences depending on what transcript you're looking at (e.g. 'intronic' in one transcript but 'non-synonymous coding' in another). The table summarizes these numbers.

To get just the unique variations that falls within this gene, the API script provided above will work fine. If you need the variation consequences for each variation on each transcript, you can do something like this:

my $species='human';
my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org',-user => 'anonymous');

my $gene_adaptor =$reg->get_adaptor($species, 'core','Gene');
my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000133895');
my $tv_adaptor = $reg->get_adaptor($species,'variation','transcriptvariation');

my $tvs = $tv_adaptor->fetch_all_by_Transcripts($gene->get_all_Transcripts());
foreach my $tv (@{$tvs}) {
   print join("\t",($tv->variation_feature->variation_name(),$tv->transcript_stable_id,join(",",@{$tv->consequence_type()}))) . "\n";
}

Hope this helps
/Pontus Larsson - Ensembl Variation

PS. Note that in the Ensembl pipeline, we do some consistency checking on the data we import. Prior to release 61, we deleted data for variations that fail these checks from the database but starting from release 61, we just flag them and keep the data. By default, these flagged variations are not displayed in the web browser so they won't be included in the variation consequence table. These flagged variations can be viewed in the location view by turning on the 'Failed variations' track. The default API behaviour is also to not return these variations but this behaviour can be changed (see the variation API tutorial for details). The MySQL query given above will return these flagged variations and you would need to do a left join to the 'failed_variation' table to properly filter them out.

ADD COMMENT
0
Entering edit mode

very interesting Pontus, thanks. (And welcome on Biostar)

ADD REPLY
0
Entering edit mode

Wonderful! Now, everything makes sense!!! Gonna flag this as the right one for the excelent explanation. But, everyone contributed to made my day!!! YEAH! Variations in hand right now via perl API and SQL!!! Certainly will include a mention to Biostar in the paper acknowledgement!

ADD REPLY
3
Entering edit mode
13.7 years ago
User 6659 ▴ 980
my $species='human';
my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org',-user => 'anonymous');
my $vfa = $reg->get_adaptor($species, 'variation', 'variationfeature');
my $slice_adaptor = $reg->get_adaptor($species, 'core', 'slice');
my $gene_adaptor =$reg->get_adaptor($species, 'core','Gene');
my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000133895');

my  $slice = $gene->feature_Slice();
my $counter = 0;
foreach my $vf (@{$vfa->fetch_all_by_Slice($slice)}) {
    $counter++;
    print $vf->seq_region_start(), '-', $vf->seq_region_end(), ' ', $vf->allele_string(), "\n";
}

print "counter: $counterndonen";
ADD COMMENT
1
Entering edit mode

the api and db return different numbers becase the api only gets dbSNP and hgmd whereas the db gets everything (cosmic too). I don't know where the website gets the figure 6348 from at all. I get 473 vfs in dbSNP and cosmic with 13699 transcript variants with the db, and 474 and 13700 with the api. One has gone missing somewhere though :)

ADD REPLY
1
Entering edit mode

if you want somatic mutations as well i don't know how to get them with the api. if you just want inherited variants you can use the api code already give but recurse over transcripts (and you'll get 13700). if you do need somati variants then you can use this query in the db

ADD REPLY
0
Entering edit mode

That's good too. I've tried smth very similar. Your script returned 474 results. But, I can't understand exactly what's happening. It returned HGMD mutations, but just part of them (6348 in total). Should I recurse over transcripts?

ADD REPLY
0
Entering edit mode

i don't know what information you want to be fair. Do you want just HGMD variants? Do you want the variation_features (i.e. the locations of the variations) or all the transcripts affected by the variants. Do you want SNPs and indels?

ADD REPLY
0
Entering edit mode

i don't know what information you want. Do you want just HGMD variants? Do you want the variation_features (i.e. the locations of the variations) or all the transcripts affected by the variants. Do you want SNPs and indels?

ADD REPLY
0
Entering edit mode

In fact I want all variants of a gene. I need them to model how this gene is evolving in the human population. I don't understand Ensembl numbers. Check this numbers http://www.ensembl.org/Homo_sapiens/Gene/Variation_Gene/Table?g=ENSG00000133895;r=11:64570988-64578766

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

You want somatic variations too?

ADD REPLY
0
Entering edit mode
mysql> select count(*) from variation_feature vf \
        inner join seq_region sr on sr.seq_region_id = vf.seq_region_id \
        inner join variation v on vf.variation_id = v.variation_id \
        left outer join failed_variation fv on fv.variation_id = v.variation_id \
        inner join transcript_variation tv \
            on tv.variation_feature_id = vf.variation_feature_id \
        inner join source s on s.source_id = vf.source_id \
        where sr.name = '11' and vf.seq_region_start >=64570988 and \
        vf.seq_region_end <= 64578766 and fv.variation_id is null;
ADD REPLY
0
Entering edit mode

I'll surely try this. The number differences are really puzzling me. Anyway, MEN1 is a much more complicated gene than I tought.

ADD REPLY
0
Entering edit mode

some bad luck with numbers here actually. I thought that because i got the same number of variants as the previous respondent who filtered by gene, there was only one gene on that chromosome. I should have checked but there is > 1 gene there. The perl and sql given assumes only one gene there. I've modified the perl api code and sql looking for only transcripts in that gene and get figures of about 9000 but not 6000

ADD REPLY
0
Entering edit mode

i deleted some previous comments for clarity. some bad luck with numbers here actually. I thought that because i got the same number of variants with my sql without filtering by gene as the previous respondent (whose query seemed to filter by gene), there was only one gene on that chromosome. I should have checked but there is > 1 gene there. The perl given assumes only one gene there. I've modified the perl api code and sql looking for only transcripts in that gene

ADD REPLY
3
Entering edit mode
13.7 years ago

Just to add to Pontus' answer, if you are also interested in somatic mutations (which are included in the numbers in the table on the website) you can fetch these with the fetch_all_somatic_by_Transcripts() method on the TranscriptVariationAdaptor. The script above could be modified to include these as follows:

my $tvs = $tv_adaptor->fetch_all_by_Transcripts($gene->get_all_Transcripts());
push @$tvs, @{ $tv_adaptor->fetch_all_somatic_by_Transcripts($gene->get_all_Transcripts()) };

And then loop over $tvs as before.

Cheers,
Graham Ritchie - Ensembl Variation

ADD COMMENT
0
Entering edit mode

Many thanks for the tip! Be welcome you too to Biostar! Now I've got the gist of the API. Let's see what can I do . . .

ADD REPLY
1
Entering edit mode
13.7 years ago

I thought it would be interesting to find the mysql query for this request.

I'm not a specialist of the SQL schema for ENSEMBL and I would be interested if someone could improve and/or criticize this SQL query.

> mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306

use homo_sapiens_core_61_37f;

select distinct
  XREF.display_label,
  VAR.name
from
   external_db as EXTDB,
   xref as XREF,
   gene as GENE,
   homo_sapiens_variation_61_37f.variation_feature as VARFEAT,
   homo_sapiens_variation_61_37f.variation as VAR
where
   EXTDB.db_name="HGNC" and
   XREF.external_db_id = EXTDB.external_db_id and
   XREF.display_label="MEN1" and
   XREF.xref_id = GENE.display_xref_id and
   GENE.seq_region_id = VARFEAT.seq_region_id and
   VARFEAT.seq_region_start &gt;= GENE.seq_region_start and 
   VARFEAT.seq_region_end &lt;= GENE.seq_region_end and 
   VARFEAT.variation_id = VAR.variation_id
order by VARFEAT.seq_region_start
;

+---------------+-------------+
| display_label | name        |
+---------------+-------------+
| MEN1          | rs117705251 |
| MEN1          | rs34171410  |
| MEN1          | rs71581758  |
| MEN1          | rs1804850   |
| MEN1          | rs71581759  |
| MEN1          | rs71996491  |
| MEN1          | rs1804848   |
| MEN1          | rs1804849   |
| MEN1          | rs111895237 |
| MEN1          | rs71581749  |
(...)
ADD COMMENT
1
Entering edit mode
mysql> select * from variation_feature vf \
        inner join seq_region sr on sr.seq_region_id = vf.seq_region_id \
        inner join variation v on vf.variation_id = v.variation_id left \
        outer join failed_variation fv on fv.variation_id = v.variation_id \
        where sr.name = '11' and vf.seq_region_start >=64570988 and \
        vf.seq_region_end <= 64 578766 and fv.variation_id is null;
ADD REPLY
1
Entering edit mode

if you choose to go with the database schema directly and not use the api you are going to lose all of the 'protection' the api gives you such as sorting the failed variation features from the good ones

ADD REPLY
0
Entering edit mode

Absolutely nice! A SQL approach would be equally suitable! And you're right about ensembl schemas. They're a complexity nightmare. So, let's see how this query behaves...

ADD REPLY
0
Entering edit mode

Hum... I think we hit the same problem with different approaches. When you enter MEN1 in the websearch, it returns 542 variants (mutations+variants in ensembl sense). But, when you browse the human gene there are 9751! Your query returned 621. I'm a bit lost right now. Time to tweak a little ...

ADD REPLY
0
Entering edit mode

That query returns 621 because it includes failed variants. There are 618 without failed variants.

ADD REPLY
0
Entering edit mode

but in the web search ~600 (!) SNPs have been duplicated. e.g: I counted "rs653534" 19 times.

ADD REPLY
0
Entering edit mode

-1 for inefficient query. Use Ensembl API please.

ADD REPLY

Login before adding your answer.

Traffic: 1943 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6