Getting Allele Frequencies From 1000 Genomes
5
28
Entering edit mode
13.7 years ago
Stephen 2.8k

Hi folks.

I'm fine-mapping an association to a gene using next-generation sequencing data in African-Americans. I have a list of a few thousand variants on chromosome 12 listed by position based on UCSC hg19 build 36.

I want to retrieve the reference/variant alleles and minor allele frequency from 1000 genomes project for YRI samples for comparison to my own sequencing data. How might I best do this without downloading the 1000 genomes data and recomputing allele frequencies? Is there a way to query ensembl or UCSC for this information?

Thanks in advance.

genome hapmap maf ucsc ensembl • 23k views
ADD COMMENT
0
Entering edit mode

just to check do you mean ncbi36/hg18 or GRCh37/hg19

ADD REPLY
25
Entering edit mode
13.7 years ago
Stephen 2.8k

Thanks for all the responses. I actually got a very helpful response from a friend of mine. It was dead simple to download and compile vcftools and tabix on my virtual linux system. It took about 30 seconds to download my genotype calls of interest from 1000 genomes using tabix:

tabix -f -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz 12:101266000-101422000 > afr.vcf

Then less than a second to analyze this data for allele frequency using vcftools:

vcftools --vcf afr.vcf --freq --out freq-afr

Which gives me exactly what I want.

CHROM    POS    N_ALLELES    N_CHR    {ALLELE:FREQ}
12    101266049    2    348    A:0.87069    G:0.12931
12    101266292    2    348    A:0.954023    T:0.045977
12    101266307    1    10    G:1
12    101266725    1    10    T:1
12    101266729    2    348    G:0.991379    A:0.00862069
12    101266756    2    348    T:0.985632    G:0.0143678

Now, I have frequency info for my data mapped to UCSC hg19/b36. Now I just need to figure out how to match these up to the 1000 genomes frequencies using their positions.

ADD COMMENT
2
Entering edit mode

+1. I usually recommend to directly look at the 1000g ftp rather than from a 3rd-party database. Nonetheless, you should read the policy when you use 1000g data in large scale.

ADD REPLY
6
Entering edit mode
13.7 years ago
Bert Overduin ★ 3.7k

Using the Ensembl Perl API:

#!/usr/local/bin/perl
use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

# get a variation adaptor
my $va = $reg->get_adaptor("human", "variation", "variation");

# fetch the variation by name
my $v = $va->fetch_by_name('rs123');

# get all the alleles
my @alleles = @{$v->get_all_Alleles()};

foreach my $allele(@alleles) {
    if($allele->population && $allele->population->name =~ /1000GENOMES.*YRI/){
        print
          $v->name, "\t",
          $allele->allele, "\t",
          (defined($allele->frequency) ? $allele->frequency : "-"), "\t",
          $allele->population->name, "\n";
    }
}

Or starting out from the position of a variant instead of the rs number:

#!/usr/local/bin/perl
use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $sa = $reg->get_adaptor("human", "core", "slice");
my $vfa = $reg->get_adaptor("human", "variation", "variationfeature");

my $chromosome = '7';
my $position = 24966446;

my $slice = $sa->fetch_by_region('chromosome', $chromosome, $position, $position);

my @vfs = @{$vfa->fetch_all_by_Slice($slice)};

foreach my $vf(@vfs){

  my @alleles = @{$vf->variation->get_all_Alleles()};

  foreach my $allele(@alleles) {
    if($allele->population && $allele->population->name =~ /1000GENOMES.*YRI/){
      print
        $vf->seq_region_name, "\t",
        $vf->seq_region_start, "\t",         
        $vf->seq_region_end, "\t",
        $vf->variation->name, "\t",
        $allele->allele, "\t",
        (defined($allele->frequency) ? $allele->frequency : "-"), "\t",
        $allele->population->name, "\n";
    }
  }
}

Ensembl questions can also be posted to the Ensembl Helpdesk or the Ensembl developers mailing list.

ADD COMMENT
0
Entering edit mode

Thanks Bert. I've never used the Ensembl Perl API before. It looks like this is fetching by RS number. If all I have is a list of positions (UCSC hg19/b36) on chromosome 12, is there an easy modification to this perl script that wouldn't require figuring out how to get RS numbers for those positions?

ADD REPLY
0
Entering edit mode

Stephen, I added a second code example that starts with a genomic position instead of a rs number. Please note that in Ensembl we distinguish variations and variation features. A variation feature is the genomic mapping of a variation (one variation can in principle map to multiple positions in the genome!). For more information regarding the Ensembl API, please have a look at http://www.ensembl.org/info/data/api.html. If you have more questions, these can be directed to the Ensembl Helpdesk (helpdesk@ensembl.org).

ADD REPLY
3
Entering edit mode
13.7 years ago

In Pierre's answer to this question 1000 genome allele frequencies are being returned (along frequencies from other data sources) for a query to ensembl. That suggests that you could adapt his SQL to get the answer you want without recomputing the allele frequencies.

ADD COMMENT
3
Entering edit mode
13.7 years ago

in case you prefer an interfaced solution rather than an automated script or sql query, maybe you could be interested in using BioMart (as several times suggested here) by selecting the most up to date human database (ensembl variation 61) and dataset (human variation 132), filling in the "Multiple Chromosomal Regions" field with your variants' positions. note that these positions should be formatted as Chr:Start:End:Strand. you will then be able to select the attributes you are interested in, such as "allele", "population name" and "genotype frequency", and once you generate such query you'll be able to filter (now locally) by the YRI population.

[self-promoting mode on] or you may want to directly access SPSmart, since we've recently incorporated 1000 Genomes data (accessible through here, although we are waiting for a BMC Bioinformatics paper to be accepted in order to allow proper merging with the rest of the available datasets). all you'll have to do is to select the YRI population and then query either the chr12 region of interest or the chromosome positions with a "chr:pos" format. [self-promoting mode off]

ADD COMMENT
0
Entering edit mode
7.8 years ago
Isaac Joseph ▴ 170

Here's an example programmatically, using python and the handy-dandy myvariant package (thanks Cyrus Afrasiabi, Chunlei Wu):

import myvariant
myvariant.MyVariantInfo()
mv.query('dbsnp.rsid:rs58991260', fields='dbsnp')['hits'][0]['dbsnp']['gmaf'] # returns 0.02157

Note that this selects the first possible hit in dbSNP for the given query. Queries can be made based on base pair/sequence name in addition to by rsID as shown.

ADD COMMENT

Login before adding your answer.

Traffic: 2724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6