Question

Getting Allele Frequencies From 1000 Genomes

28

Entering edit mode

14.1 years ago

Stephen 2.8k

Hi folks.

I'm fine-mapping an association to a gene using next-generation sequencing data in African-Americans. I have a list of a few thousand variants on chromosome 12 listed by position based on UCSC hg19 build 36.

I want to retrieve the reference/variant alleles and minor allele frequency from 1000 genomes project for YRI samples for comparison to my own sequencing data. How might I best do this without downloading the 1000 genomes data and recomputing allele frequencies? Is there a way to query ensembl or UCSC for this information?

Thanks in advance.

genome hapmap maf ucsc ensembl • 24k views

ADD COMMENT • link updated 8.1 years ago by Isaac Joseph ▴ 170 • written 14.1 years ago by Stephen 2.8k

0

Entering edit mode

just to check do you mean ncbi36/hg18 or GRCh37/hg19

ADD REPLY • link 14.1 years ago by Laura ★ 1.8k

6

Entering edit mode

14.1 years ago

Bert Overduin ★ 3.7k

Using the Ensembl Perl API:

#!/usr/local/bin/perl
use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

# get a variation adaptor
my $va = $reg->get_adaptor("human", "variation", "variation");

# fetch the variation by name
my $v = $va->fetch_by_name('rs123');

# get all the alleles
my @alleles = @{$v->get_all_Alleles()};

foreach my $allele(@alleles) {
    if($allele->population && $allele->population->name =~ /1000GENOMES.*YRI/){
        print
          $v->name, "\t",
          $allele->allele, "\t",
          (defined($allele->frequency) ? $allele->frequency : "-"), "\t",
          $allele->population->name, "\n";
    }
}

Or starting out from the position of a variant instead of the rs number:

#!/usr/local/bin/perl
use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $sa = $reg->get_adaptor("human", "core", "slice");
my $vfa = $reg->get_adaptor("human", "variation", "variationfeature");

my $chromosome = '7';
my $position = 24966446;

my $slice = $sa->fetch_by_region('chromosome', $chromosome, $position, $position);

my @vfs = @{$vfa->fetch_all_by_Slice($slice)};

foreach my $vf(@vfs){

  my @alleles = @{$vf->variation->get_all_Alleles()};

  foreach my $allele(@alleles) {
    if($allele->population && $allele->population->name =~ /1000GENOMES.*YRI/){
      print
        $vf->seq_region_name, "\t",
        $vf->seq_region_start, "\t",         
        $vf->seq_region_end, "\t",
        $vf->variation->name, "\t",
        $allele->allele, "\t",
        (defined($allele->frequency) ? $allele->frequency : "-"), "\t",
        $allele->population->name, "\n";
    }
  }
}

Ensembl questions can also be posted to the Ensembl Helpdesk or the Ensembl developers mailing list.

ADD COMMENT • link 14.1 years ago by Bert Overduin ★ 3.7k

0

Entering edit mode

Thanks Bert. I've never used the Ensembl Perl API before. It looks like this is fetching by RS number. If all I have is a list of positions (UCSC hg19/b36) on chromosome 12, is there an easy modification to this perl script that wouldn't require figuring out how to get RS numbers for those positions?

ADD REPLY • link 14.1 years ago by Stephen 2.8k

0

Entering edit mode

Stephen, I added a second code example that starts with a genomic position instead of a rs number. Please note that in Ensembl we distinguish variations and variation features. A variation feature is the genomic mapping of a variation (one variation can in principle map to multiple positions in the genome!). For more information regarding the Ensembl API, please have a look at http://www.ensembl.org/info/data/api.html. If you have more questions, these can be directed to the Ensembl Helpdesk (helpdesk@ensembl.org).

ADD REPLY • link 14.1 years ago by Bert Overduin ★ 3.7k

3

Entering edit mode

14.1 years ago

David Quigley 11k

In Pierre's answer to this question 1000 genome allele frequencies are being returned (along frequencies from other data sources) for a query to ensembl. That suggests that you could adapt his SQL to get the answer you want without recomputing the allele frequencies.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by David Quigley 11k

3

Entering edit mode

14.1 years ago

Jorge Amigo 14k

in case you prefer an interfaced solution rather than an automated script or sql query, maybe you could be interested in using BioMart (as several times suggested here) by selecting the most up to date human database (ensembl variation 61) and dataset (human variation 132), filling in the "Multiple Chromosomal Regions" field with your variants' positions. note that these positions should be formatted as Chr:Start:End:Strand. you will then be able to select the attributes you are interested in, such as "allele", "population name" and "genotype frequency", and once you generate such query you'll be able to filter (now locally) by the YRI population.

[self-promoting mode on] or you may want to directly access SPSmart, since we've recently incorporated 1000 Genomes data (accessible through here, although we are waiting for a BMC Bioinformatics paper to be accepted in order to allow proper merging with the rest of the available datasets). all you'll have to do is to select the YRI population and then query either the chr12 region of interest or the chromosome positions with a "chr:pos" format. [self-promoting mode off]

ADD COMMENT • link 14.1 years ago by Jorge Amigo 14k

0

Entering edit mode

8.1 years ago

Isaac Joseph ▴ 170

Here's an example programmatically, using python and the handy-dandy myvariant package (thanks Cyrus Afrasiabi, Chunlei Wu):

import myvariant
myvariant.MyVariantInfo()
mv.query('dbsnp.rsid:rs58991260', fields='dbsnp')['hits'][0]['dbsnp']['gmaf'] # returns 0.02157

Note that this selects the first possible hit in dbSNP for the given query. Queries can be made based on base pair/sequence name in addition to by rsID as shown.

ADD COMMENT • link 8.1 years ago by Isaac Joseph ▴ 170

score 25 · Accepted Answer · 2011-03-30

Thanks for all the responses. I actually got a very helpful response from a friend of mine. It was dead simple to download and compile vcftools and tabix on my virtual linux system. It took about 30 seconds to download my genotype calls of interest from 1000 genomes using tabix:

tabix -f -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz 12:101266000-101422000 > afr.vcf

Then less than a second to analyze this data for allele frequency using vcftools:

vcftools --vcf afr.vcf --freq --out freq-afr

Which gives me exactly what I want.

CHROM    POS    N_ALLELES    N_CHR    {ALLELE:FREQ}
12    101266049    2    348    A:0.87069    G:0.12931
12    101266292    2    348    A:0.954023    T:0.045977
12    101266307    1    10    G:1
12    101266725    1    10    T:1
12    101266729    2    348    G:0.991379    A:0.00862069
12    101266756    2    348    T:0.985632    G:0.0143678

Now, I have frequency info for my data mapped to UCSC hg19/b36. Now I just need to figure out how to match these up to the 1000 genomes frequencies using their positions.