We are trying to determine the sensitivity of our CNV detection method for various types of arrays. We have two HapMap/1000 genomes samples that are plated multiple times on different array types: NA10851 and NA18505. I understand that these are relatively well-characterized samples. It is expected that deletions are easier to detect than duplication events. We were hoping to use the 1000 genomes sequencing data as a gold standard to see at what level we are correctly identifying events where there is homozygous deletion. I am pulling data from 1000 genomes for these individuals using the following command.
~/tabix-0.2.5/tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 4:1-191154276 | perl ~/vcftools_0.1.7/perl/vcf-subset -c NA10851,NA18505 > chr4.vcf &
So my question has two parts: 1. Is there a way to modify this command so that only SVs are pulled (and SNPs/indels not)? 2. Is there a better source (than 1000 genomes) for the known SVs (particularly deletions) in these individuals against we could compare the sensitivity of our detection on different arrays?
Thanks,
Ryan
Update: after downloading the 1kG files, extracting the SVs (n=14,422) for all 1,092 individuals, I joined all SVs into a single vcf file (a bit over 500Mb). I was able to output the vcf file for those two individuals much more quickly with vcftools using the command:
vcftools --vcf g1k_sv_all.vcf --keep ~/id_of_two.txt --recode --out NA10851_NA18505
Where idoftwo.txt is a file containing the NA10851 and NA18505 ids, one on each line. It ran in 18 seconds vs. the perl which had not finished after several hours.
Thanks, Matt. I am familiar with Park. That is a good paper. I guess the supplemental has all of the CNVs for the Asian samples (http://www.nature.com/ng/journal/v42/n5/full/ng.555.html#/supplementary-information), but I didn't see it for NA10851. Also not in DGV under Park: http://projects.tcag.ca/variation/downloads/Park2010.txt .