1000 Genomes Api
1
1
Entering edit mode
11.1 years ago
User 1933 ▴ 360

So basically, I wonder if there is any API for working with 1000 genome data - to makes it easier to parse the VCF file. For example, I would like to see how many people have variants in gene "CHAT", and then generate the corresponding VCF file, and running variant effect predictor on that vcf file.

I am fine to make my own pipeline to some extend, but I could not discover how to get the VCF file with individual information for a given gene automatically.

1000genomes vcf parsing • 4.0k views
ADD COMMENT
4
Entering edit mode
11.1 years ago

a one-liner:

using the ucsc mysql server, get the coordinate of the gene "CHAT", build the 100G URL and the region, and xargs to tabix:

 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N -e  'select concat("ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.",K.chrom,".phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"),concat(RIGHT(K.chrom,LENGTH(chrom)-3),":",MIN(K.txStart)+1,"-",MAX(K.txEnd)) from knownGene as K,kgXref as X where K.name=X.kgId and X.geneSymbol="CHAT" '   | xargs ./tabix-0.2.5/tabix -h

output:

##fileformat=VCFv4.1
(...)
##reference=GRCh37
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    HG00096    HG00097    HG00099    HG00100    HG
10    50817173    rs186028903    C    T    100    PASS    ERATE=0.0005;AA=C;THETA=0.0032;RSQ=0.6818;L
10    50817210    rs80200823    G    C    100    PASS    AC=258;AA=G;AN=2184;LDAF=0.1214;VT=SNP;ERATE
10    50817321    rs190746084    G    T    100    PASS    AVGPOST=0.9963;THETA=0.0015;RSQ=0.3962;AA=G
10    50817546    rs139389571    C    A    100    PASS    LDAF=0.0052;AA=C;AVGPOST=0.9966;RSQ=0.7248;
10    50817665    rs183262975    C    A    100    PASS    THETA=0.0039;RSQ=0.6013;AA=C;AN=2184;VT=SNP
10    50817698    rs150036809    G    A    100    PASS    LDAF=0.0120;AA=G;AN=2184;RSQ=0.4837;VT=SNP;
10    50817735    rs8178981    C    T    100    PASS    AA=C;AN=2184;VT=SNP;THETA=0.0006;AC=14;RSQ=0.
10    50817751    rs12774253    C    T    100    PASS    AC=268;AA=C;RSQ=0.9685;AN=2184;THETA=0.0008;
10    50817803    rs191436474    G    T    100    PASS    ERATE=0.0005;THETA=0.0002;AA=G;AN=2184;LDAF
10    50817960    rs182461778    G    A    100    PASS    THETA=0.0002;AA=G;AN=2184;AC=20;RSQ=0.3961;
10    50818011    rs885835    C    T    100    PASS    AA=C;AN=2184;ERATE=0.0025;AVGPOST=0.9357;VT=SN
 (...)
ADD COMMENT
0
Entering edit mode

can you make some comment what "concat(RIGHT(K.chrom,LENGTH(chrom)-3),":",MIN(K.txStart)-1,"-",MAX(K.txEnd)) from knownGene as K,kgXref as X where K.name=X.kgId and X.geneSymbol="x" is ?! thanks :)

ADD REPLY
0
Entering edit mode

K.chrom,LENGTH(chrom)-3) : remove the "chr' prefix MIN(K.txStart)-1: left/min part of the area. Hum that should be +1 , not -1.. I'll fix it. ,MAX(K.txEnd): right part of the area.

see Genomic cordinates from UCSC

ADD REPLY
0
Entering edit mode

How about also adding a specific region ?! say from 50817173 to 50817735 ! I appreciate your help

ADD REPLY
2
Entering edit mode
(. where K.chrom="chr1" and NOT (K.txStart>50817735 or K.txEnd < 50817173 )
ADD REPLY
0
Entering edit mode

Thanks a lot ! It was working nicely - but now I am facing

[ti_index_load] wrong magic number. [ti_index_load] fail to load the index: ALL.chr10.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi [tabix] failed to load the index file.

What could be the reason ?!

ADD REPLY
0
Entering edit mode

I feel stupid to say, but this piece does not work :( !

ADD REPLY
0
Entering edit mode

hi, my friend... is the SNP archive like "phase1_release_v3.20101123.snps" often updated? If it is often updated, where can I find the information to update the information like "phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz" in your one-liner?

ADD REPLY
0
Entering edit mode

I took the first VCF in the 1000G project I found. It should work for any tabix-indexed VCF on the web.

ADD REPLY
0
Entering edit mode

hi, my friend, I found "The 1000 genomes snp and short indel all get submitted to dbSNP....." in this link "http://www.1000genomes.org/category/dbsnp". Is there any one-liner or programmatically method to get the similiar result from dbSNP?

If I have got the elements for our own SNP as below:

  1. Flank DNA;
  2. The gene name.

How can we retrieve data related with our own SNP from dbSNP programmatically?

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6