Question

How To Filter Vcf Files From 1000 Genomes Release V3.2010-11 (Alternative Source)?

0

Entering edit mode

13.2 years ago

user56 ▴ 300

I want to use VCF files from WGS to arrive at pharmacogenomics clinical recommendations (relevant to a single patient, not a population).

I decided to use VCF as standard for input data and 1000 genomes as the test population. I belive the files I need are here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ (if not, please comment on that)

The problem is that the files are too big. For example chromosome 6 data for all populations is 9 GB big. All chomomosomes data would be 80+ GB.

Example of chr6 file: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz

Is there any alternative source where the 1000genomes data would be in different shapes?

What I would want would be to:

use only call coming from SNPSOURCE=EXOME
I would like to filter the file only to known SNPs (within dbSNP)
Filter out all INDELS.
Make the number of genomes smaller (e.g., 1 patient or no more than 50 patients)

Is the only way to download 80GB, let it crunch for a long time? (and for me also improve my linux knowledge (I am windows and SQL and R person). Any advice greatly appreciated.

p.s. I seems all genomic stuff is in files. I am good with large databases and could do what I need much easier in a database. After all, a VCF file is like a database table.

1000genomes vcf • 4.0k views

ADD COMMENT • link updated 13.2 years ago by Laura ★ 1.8k • written 13.2 years ago by user56 ▴ 300

score 1 · Answer 1 · 2012-05-31

1

Entering edit mode

13.1 years ago

Laura ★ 1.8k

You could use tabix to stream these files from the ftp site and filter the sites you don't want out. You could also reduce the number of individuals if you wanted aswell

We have more info about how to use tabix on in our faq http://www.1000genomes.org/faq/how-do-i-get-sub-section-vcf-file

ADD COMMENT • link 13.1 years ago by Laura ★ 1.8k

score 0 · Answer 2 · 2012-05-15

0

Entering edit mode

13.2 years ago

hershman ▴ 40

I was unable to find exome calls from the 1000 genomes project about a month back. One option to avoid downloading the files is to play with them on Amazon

ADD COMMENT • link 13.2 years ago by hershman ▴ 40

score 0 · Answer 3 · 2012-05-16

0

Entering edit mode

13.2 years ago

thamathpanda ▴ 40

VCFtools bro

A database would probably be slower fyi.

ADD COMMENT • link 13.2 years ago by thamathpanda ▴ 40