Getting Common Cnvs / Svs For Hapmap/1Kg Samples
2
0
Entering edit mode
12.2 years ago
Ryan D ★ 3.4k

We are trying to determine the sensitivity of our CNV detection method for various types of arrays. We have two HapMap/1000 genomes samples that are plated multiple times on different array types: NA10851 and NA18505. I understand that these are relatively well-characterized samples. It is expected that deletions are easier to detect than duplication events. We were hoping to use the 1000 genomes sequencing data as a gold standard to see at what level we are correctly identifying events where there is homozygous deletion. I am pulling data from 1000 genomes for these individuals using the following command.

~/tabix-0.2.5/tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 4:1-191154276 | perl ~/vcftools_0.1.7/perl/vcf-subset -c NA10851,NA18505 > chr4.vcf &

So my question has two parts: 1. Is there a way to modify this command so that only SVs are pulled (and SNPs/indels not)? 2. Is there a better source (than 1000 genomes) for the known SVs (particularly deletions) in these individuals against we could compare the sensitivity of our detection on different arrays?

Thanks,

Ryan

Update: after downloading the 1kG files, extracting the SVs (n=14,422) for all 1,092 individuals, I joined all SVs into a single vcf file (a bit over 500Mb). I was able to output the vcf file for those two individuals much more quickly with vcftools using the command:

vcftools --vcf g1k_sv_all.vcf --keep ~/id_of_two.txt --recode --out NA10851_NA18505

Where idoftwo.txt is a file containing the NA10851 and NA18505 ids, one on each line. It ran in 18 seconds vs. the perl which had not finished after several hours.

cnv sv hapmap • 3.7k views
ADD COMMENT
1
Entering edit mode
12.2 years ago

For NA10851, you should take a look at Park, H. et al. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nature Genetics 42, 400–405 (2010). This should give you a gold standard for which to compare your CNVs to.

Based on what I could find, NA18505 may be more difficult. Check out Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).

ADD COMMENT
0
Entering edit mode

Thanks, Matt. I am familiar with Park. That is a good paper. I guess the supplemental has all of the CNVs for the Asian samples (http://www.nature.com/ng/journal/v42/n5/full/ng.555.html#/supplementary-information), but I didn't see it for NA10851. Also not in DGV under Park: http://projects.tcag.ca/variation/downloads/Park2010.txt .

ADD REPLY
1
Entering edit mode
12.2 years ago
Ryan D ★ 3.4k

Matt got me looking through DGV a little further. It seems NA10851 has CNV calls done by many of the same authors who did the Park et al paper . The link to that data is here. Altshuler et al did NA18505 and other HapMap 3 samples in this paper and the data for it is also in DGV, here. If someone can find better sources, please let me know.

I have also extracted the SVs for these two samples from the 1000 genomes data and am happy to provide the VCF or PLINK/PED file for anyone who cares to have it.

ADD COMMENT

Login before adding your answer.

Traffic: 1759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6