Why are SNP6 arrays and array CGH the gold standards for somatic copy number detection?
1
0
Entering edit mode
4.7 years ago

Why are array comparative genomic hybridization (CGH) and SNP6 arrays considered to be the gold standard for benchmarking somatic copy number variation (CNV) calls in tumor samples?

And would one expect data from these technologies to be more sensitive and precise than NGS approaches based on read depth? If so, why?

I am trying to reason why it is fair to consider array CGH and SNP6 as more reliable, and as a consequence, use them to classify CNV calls from NGS-based methods as false positives and false negatives, for example.

Array CGH and SNP6 are used for benchmarking the copy-number calling algorithms in the CNVkit paper (Talevich et al 2015) and the recent PureCN paper (Oh et al 2020), respectively.

Thanks!

CNV CGH SNP6 Copy Number NGS • 1.7k views
ADD COMMENT
1
Entering edit mode
4.7 years ago

You will find mixed opinions on this due to various reasons. My answer will be supportive of the use of the Affymetrix SNP 6.0 for global copy number (CN) screening.

So, the sixth generation of Affymetrix SNP arrays was the Genome-Wide Human SNP Array 6.0 (Affymetrix SNP 6.0). Although it genotypes much less than the now total number of known SNPs, it is capable of genotyping 906,600 SNPs and contains an additional 945,826 probes for detecting CN. Of the CN probes, 115,000 target previously-known CNVs (at the time of array design). Overall, markers are distributed evenly across the genome, with a median marker spacing of 1-5Kb.

The probes are 25mer and only 40ng/μl DNA is required, which was / is much less than that required by the equivalent Illumina and Agilent arrays at the time. The SNP probes have 6 or 8 replicates (3 or 4 for each A and B allele), while CN probes comprise just a single replicate. The Birdseed algorithm was used (perhaps still is) to genotype the SNP probes, while the Canary algorithm was used to process the CN probes. Conversely, taking both the SNP and CN data combined, one can also infer LOH via the CN5 BRLMM-P+ algorithm.

The beauty of the array, thus, is how probes are evenly distributed across the genome (with some notable exceptions). Not just that, both the SNP and CN probes can be used to determine CN via the CRMA v2 method by Henrik Bengtsson. Since around 2013, once you have defined the copy number signal at each site, the Circular Binary Segmentation (CBS) algorithm became popular for then processing these CN signals into logical segments, genome-wide, based on similar CN profiles. It was the CBS algorithm that was ultimately used by the TCGA for all of their Level 3 CN data.

I have no huge issues with CN determination via NGS. Both technologies will suffer (but can both correct for) GC content biases, and will also both suffer from repetitive sequence and sequence similarity - some regions of the human genome are just impossible to faithfully sequence with short read technology; however, a 25mer probe could neither faithfully genotype such a region. On the other hand, the array will suffer from things like signal 'cross-talk', while an NGS instrument has its own major inherent issues (assuming sequencing by synthesis, SBS).

Some benefits of the array are that it is lightweight and can produce results rapidly. Also, much of the probes have been in development at Affymetrix for years (on previous arrays), and have therefore already passed through QC. Regarding NGS, I think that it is not yet clear that read depth is any more accurate for determining true CN when compared to a simple probe intensity.

I will say this: I have used so many copy number calling programs for NGS data and only one gives me confidence in the results: Control-FREEC.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3568 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6