Hi.
I'm writing a simple CNV detector (DoC method based) and I would like to verify it's working well. Could you recommend some alignment .bam files with reported CNV I could test my prototype with?
Regards.
Hi.
I'm writing a simple CNV detector (DoC method based) and I would like to verify it's working well. Could you recommend some alignment .bam files with reported CNV I could test my prototype with?
Regards.
If you want to look at established and polymorphic CNPs that are known to exist, there are a number of files at the Database of Genomic Variants which can be downloaded and modified. Many of them use the most common reference samples from Hapmap and have finely mapped and validated losses and gains. The website at which to download these is here:
http://projects.tcag.ca/variation/tableview.asp?table=DGV_Content_Summary.txt
Also the current table matching hg19 from DGV has a lot of CNVs and the method by which they were identified:
http://projects.tcag.ca/variation/downloads/variation.hg19.v10.nov.2010.txt.
As Chris said, if you have some array data on your samples and you have some of the well-characterized ones, you will have an easier time validating whether or not your stuff is working. The "gold standard" of CNVs used by one paper (for array, not sequencing data) is described here:
http://www.nature.com/nbt/journal/v29/n6/full/nbt.1852.html#/supplementary-information
You may also check the answer to this questions and potentially try seeing if any gold standard sample BAM file from 1000G might suit your purposes:
What Are The 'Copy Number Detection' Tools Out There For Exome Capture Ngs Data.
I'm not aware of any "gold standard" CNV bams. (though it's certainly possible that I just haven't heard about them).
The usual approach to evaluating performance of CNV detection that I've seen in the literature is two fold:
1) Simulation: generate some reads that contain a CNV. Ideally, this should take into account a bunch of factors like GC-bias, mapability, and random variance in the coverage of the genome. Can your method reliably detect these at different CNV sizes and depths of coverage?
2) Concordance with arrays: Look at a sample that has both sequencing reads and high-resolution array data (Affy SNP 6.0 is common, but high-res Agilent or Illumina arrays would be fine too). Make sure that you're detecting the same events, at least at the gross scale.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You'll also probably want to look at this previous question, which gives much the same answer: Validated Copy Number Variation(Cnv) Standard
Thanks Chris for the answer and for pointing to the other thread! If you have any recommendation for simulating CNV (point 1 of your answer) feel free to tell me.