Retrieve All Population Frequency Data For A Snp In 1000Genomes Phase_1
2
0
Entering edit mode
11.9 years ago
haansi ▴ 90

Hi all!

Just found this entry: Retrieving All Available Frequency Data For A Snp Using Ensembl Api Tools which is very close to what i need. Similar to Krisr I would like to retrieve all population frequency data available from 1000Genomes phase 1 for a SNP, if possible via SQL.

Ensembls Biomart provides minor allele information for the ALL superpopulation only. Pierre Lindenbaum's solution is almost getting me to the desired result - but when I run the sql statement (on homosapiensvariation6937), I only get results from 1000Genoms:pilot_1 - not from phase1.

select distinct V.name, S.handle, A.frequency, M.name, F.allele_string
  from (  allele as A,   variation as V,   subsnp_handle as S,  variation_feature as F  ) left join  sample as M
  on (M.sample_id = A.sample_id ) 
    where 
        V.variation_id=A.variation_id and
        S.subsnp_id =A.subsnp_id and
        F.variation_id=V.variation_id and 
        V.name="rs3"
      order by 2;

Any suggestions where I could find this data? Alternatively: is there a way to get the sql statements from bioperl - since Bert Overduin provided a nice perl-script (need sql for my workflow) ?

snp 1000genomes 1000genomes variation bioperl • 4.0k views
ADD COMMENT
2
Entering edit mode
11.9 years ago
Peixe ▴ 660

Maybe dbSNP-Q could be useful...

Lets you make a query to dbSNP, 1KG, HapMap and more all-in-one through simple mySQL queries or customized ones.

ADD COMMENT
1
Entering edit mode

Hi Peixe! Thank you very much for this very interesting web-app!! Just gave it a try - unfortunately there's only data from 1000Genomes Pilot 1 not but not from 1000Genomes Phase 1. Otherwise a very cool and fast application!

ADD REPLY
1
Entering edit mode
11.9 years ago

This isn't an SQL solution but it will do the trick.

use tabix and point it at this file:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10523   .       TCCG    T       152     PASS    VT=INDEL;RSQ=0.5246;ERATE=0.0023;AN=2184;AA=.;THETA=0.0172;AC=5;AVGPOST=0.9954;LDAF=0.0045;AF=0.00;AMR_AF=0.00;AFR_AF=0.01

AF= global allele freq

AMR_AF = AMR population

ect...

ADD COMMENT
1
Entering edit mode

Hi Zev! Thank you very much for your answer! Found this approach also in a previous question (Getting Allele Frequencies From 1000 Genomes. Wasn't aware of the ftp file you mentioned though. I gave it a try, the query time was about 5 seconds (could live with it), but I need all informtions to frequencys as here in the 1000 Genomes table for my pipeline (just an example with a random snp) : http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=12:51419664-51420664;v=rs6580779;vdb=variation;vf=4447658 Are there other locations for querying?

ADD REPLY
1
Entering edit mode

The file is only a couple gigs (~2?). You could just download it and use tabix locally. The tabix index scheme makes querying trivial and very very fast.

ADD REPLY

Login before adding your answer.

Traffic: 2059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6