Question

Loading 1000 Genomes Vcf Files In R

4

Entering edit mode

13.5 years ago

Paul ▴ 760

Hi,

I'm looking to get genotypes for SNPs in a particular region of the genome from CEU and YRI HapMap individuals, I need more SNPs than just those genotyped for HapMap and the 1000 genomes project has recently released this SNP call data, generated from sequence data, in VCF format.

This data is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/

The uncompressed file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) is 11 Gigs, I'm wondering if anyone has any idea of the best way to load this and extract the genotypes from the region I need, is there a tool in R or anything else anyone could suggest for loading and dealing with this kind of data?

Thanks,

Paul

genome hapmap snp • 8.7k views

ADD COMMENT • link updated 13.5 years ago by zhanxw ▴ 20 • written 13.5 years ago by Paul ▴ 760

0

Entering edit mode

That file does not give you genotypes. The file containing the genotype is going to be half a terabyte uncompressed, I guess.

ADD REPLY • link 13.5 years ago by lh3 33k

0

Entering edit mode

It Appears you are correct! Have the genotypes not been released yet?

ADD REPLY • link 13.5 years ago by Paul ▴ 760

0

Entering edit mode

SNP data is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/

ADD REPLY • link 13.4 years ago by Paul ▴ 760

0

Entering edit mode

Those are a previous release based on 629 individuals

ADD REPLY • link 13.4 years ago by Laura ★ 1.8k

Ram · Answer 1 · 2011-06-13

"I'm wondering if anyone has any idea of the best way to load this and extract the genotypes from the region I need"

See tabix

Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the command-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface. After indexing, tabix is able to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". Fast data retrieval also works over network if URI is given as a file name and in this case the index file will be downloaded if it is not present locally.

score 0 · Answer 2 · 2011-07-13

0

Entering edit mode

13.4 years ago

Angel • 0

I am wondering what is the difference between this file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) and those name by chromsome?

ADD COMMENT • link 13.4 years ago by Angel • 0

0

Entering edit mode

please, ask a new question.

ADD REPLY • link 13.4 years ago by Pierre Lindenbaum 164k

score 0 · Answer 3 · 2011-07-18

0

Entering edit mode

13.4 years ago

Fede ▴ 10

"sites.vcf" doesn't contain any data about the genotypes ;)

by the way, are you experiencing too right now some .tbi files "incorrect" data? (90 bytes as size is way too small...)

ADD COMMENT • link 13.4 years ago by Fede ▴ 10

0

Entering edit mode

that problem has not been fixed

ADD REPLY • link 13.4 years ago by Laura ★ 1.8k

score 0 · Answer 4 · 2012-09-14

0

Entering edit mode

12.2 years ago

zhanxw ▴ 20

You can try use vcf2geno http://cran.r-project.org/web/packages/vcf2geno/index.html It takes tabix-indexed vcf file and extract genotypes for you.

ADD COMMENT • link 12.2 years ago by zhanxw ▴ 20