Question

Getting Common Cnvs / Svs For Hapmap/1Kg Samples

0

Entering edit mode

12.7 years ago

Ryan D ★ 3.4k

We are trying to determine the sensitivity of our CNV detection method for various types of arrays. We have two HapMap/1000 genomes samples that are plated multiple times on different array types: NA10851 and NA18505. I understand that these are relatively well-characterized samples. It is expected that deletions are easier to detect than duplication events. We were hoping to use the 1000 genomes sequencing data as a gold standard to see at what level we are correctly identifying events where there is homozygous deletion. I am pulling data from 1000 genomes for these individuals using the following command.

~/tabix-0.2.5/tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 4:1-191154276 | perl ~/vcftools_0.1.7/perl/vcf-subset -c NA10851,NA18505 > chr4.vcf &

So my question has two parts: 1. Is there a way to modify this command so that only SVs are pulled (and SNPs/indels not)? 2. Is there a better source (than 1000 genomes) for the known SVs (particularly deletions) in these individuals against we could compare the sensitivity of our detection on different arrays?

Thanks,

Ryan

Update: after downloading the 1kG files, extracting the SVs (n=14,422) for all 1,092 individuals, I joined all SVs into a single vcf file (a bit over 500Mb). I was able to output the vcf file for those two individuals much more quickly with vcftools using the command:

vcftools --vcf g1k_sv_all.vcf --keep ~/id_of_two.txt --recode --out NA10851_NA18505

Where idoftwo.txt is a file containing the NA10851 and NA18505 ids, one on each line. It ran in 18 seconds vs. the perl which had not finished after several hours.

cnv sv hapmap • 3.9k views

ADD COMMENT • link 12.7 years ago by Ryan D ★ 3.4k

score 1 · Answer 1 · 2012-09-17

1

Entering edit mode

12.7 years ago

Matt Shirley 10k

For NA10851, you should take a look at Park, H. et al. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nature Genetics 42, 400–405 (2010). This should give you a gold standard for which to compare your CNVs to.

Based on what I could find, NA18505 may be more difficult. Check out Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

ADD COMMENT • link 12.7 years ago by Matt Shirley 10k

0

Entering edit mode

Thanks, Matt. I am familiar with Park. That is a good paper. I guess the supplemental has all of the CNVs for the Asian samples (http://www.nature.com/ng/journal/v42/n5/full/ng.555.html#/supplementary-information), but I didn't see it for NA10851. Also not in DGV under Park: http://projects.tcag.ca/variation/downloads/Park2010.txt .

ADD REPLY • link 12.7 years ago by Ryan D ★ 3.4k

score 1 · Answer 2 · 2012-09-17

Matt got me looking through DGV a little further. It seems NA10851 has CNV calls done by many of the same authors who did the Park et al paper . The link to that data is here. Altshuler et al did NA18505 and other HapMap 3 samples in this paper and the data for it is also in DGV, here. If someone can find better sources, please let me know.

I have also extracted the SVs for these two samples from the 1000 genomes data and am happy to provide the VCF or PLINK/PED file for anyone who cares to have it.