Question

1000 Genomes Individual Genotype Data

2

Entering edit mode

12.4 years ago

win ▴ 990

I was wondering if someone could help.

When using the 1000Genomes browser I came across this statement “1000 Genomes individual genotypes display” on the search results page , if I understand correctly this means that individual genotypes for any variant are not stored in the Ensemble database but instead in the 1K Genomes database (public mysql instance).

If that is true, we can view the genotypes from the 1K genomes browser, but which table in the database contains this information?

There is table named “compressedgenotypesingle_bp”, is the table that contains this info. Also, if this the table then how does one convert the binary data fields back to text.

I am trying determine the genotype for several variants and working with the individual chromosome VCF files is not turning out to be practical, it’s very, very slow and has a huge computational overhead.

Any help in this direction will be highly appreciated.

genotyping • 4.4k views

ADD COMMENT • link updated 12.4 years ago by Laura ★ 1.8k • written 12.4 years ago by win ▴ 990

score 3 · Answer 1 · 2012-07-12

As Joachim told you, the data is not stored on a database, but on VCF files.

There have been some discussions about storing the data contained in VCF files on a SQL database, but the conversion is more difficult than what it seems, and in the end it looks like that people prefer to use the VCF files. For example, you can read this commentary by James Casbon, who is maintaining a python parser for VCF files, on a Google/Summer of Code project to implement a SQL version of VCF files: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009688.html

In any case, working on VCF files is slow only if you try to implement your own parser. There are a lot of better ways to do it: for example, you can use tabix to extract regions from the 1000genomes website (search on this forum for how to do it, e.g. http://www.biostars.org/search/?q=tabix+1000genomes ), and use VCFtools or PyVCF to do more complex operations. I work with a lot of VCF files, but using these tools I never had any performance problem.

score 2 · Answer 2 · 2012-07-11

Yes, individual genotypes of the 1000 genomes project are available according to: http://www.1000genomes.org/faq/can-i-get-individual-genotype-information-browser1000genomesorg

The data is available via FTP: http://www.1000genomes.org/data#DataAccess

There are README files on the FTP site, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/, which explain the directory hierarchy and file contents rather nicely.

Hope that helps.

score 1 · Answer 3 · 2012-07-12

As both Joachim and Giovanni have said the genotypes for the 1000 genomes data isn't stored in our mysql instance. This is because loading the genotypes unfortunately takes longer than is ideal for our website production so we decided for both speed of producing the website and speed of loading the genotypes that using the vcf file would be better

The info others have given should point you in the right direction for getting the genotypes you like quickly and easily

I would recommend looking at tabix and the vcftools script vcf-subset which are both described here

http://www.1000genomes.org/faq/how-do-i-get-slice-your-vcf-files