I want to use VCF files from WGS to arrive at pharmacogenomics clinical recommendations (relevant to a single patient, not a population).
I decided to use VCF as standard for input data and 1000 genomes as the test population. I belive the files I need are here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ (if not, please comment on that)
The problem is that the files are too big. For example chromosome 6 data for all populations is 9 GB big. All chomomosomes data would be 80+ GB.
Example of chr6 file: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz
Is there any alternative source where the 1000genomes data would be in different shapes?
What I would want would be to:
- use only call coming from SNPSOURCE=EXOME
- I would like to filter the file only to known SNPs (within dbSNP)
- Filter out all INDELS.
- Make the number of genomes smaller (e.g., 1 patient or no more than 50 patients)
Is the only way to download 80GB, let it crunch for a long time? (and for me also improve my linux knowledge (I am windows and SQL and R person). Any advice greatly appreciated.
p.s. I seems all genomic stuff is in files. I am good with large databases and could do what I need much easier in a database. After all, a VCF file is like a database table.