I'm looking to get genotypes for SNPs in a particular region of the genome from CEU and YRI HapMap individuals, I need more SNPs than just those genotyped for HapMap and the 1000 genomes project has recently released this SNP call data, generated from sequence data, in VCF format.
The uncompressed file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) is 11 Gigs, I'm wondering if anyone has any idea of the best way to load this and extract the genotypes from the region I need, is there a tool in R or anything else anyone could suggest for loading and dealing with this kind of data?
Tabix indexes a TAB-delimited genome
position file in.tab.bgz and creates
an index file in.tab.bgz.tbi when
region is absent from the
command-line. The input data file must
be position sorted and compressed by
bgzip which has a gzip(1) like
interface. After indexing, tabix is
able to quickly retrieve data lines
overlapping regions specified in the
format "chr:beginPos-endPos". Fast
data retrieval also works over network
if URI is given as a file name and in
this case the index file will be
downloaded if it is not present
locally.
That file does not give you genotypes. The file containing the genotype is going to be half a terabyte uncompressed, I guess.
It Appears you are correct! Have the genotypes not been released yet?
SNP data is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/
Those are a previous release based on 629 individuals