After reading the paper and docs, I am having a little trouble understanding the difference and usage for calling at the segment level or the bin level. I am running CNVkit using human exome seq data.
My questions are:
What use case would the segment level calls be best for
What use case would the bin level calls be best for
Which one is more accurate in what context
What is the point of having two? Is the bin level output noisier but at a higher resolution than the segment level? Just assuming default segmentation and bin sizes. I'm really confused here -- anything helps!
The two file types might be more intuitive if you have experience with an older microarray-based method, array comparative genomic hybridization (aCGH). Bins are equivalent to microarray probes there.
The bins provide a fine-grained genome-wide copy number signal plus some noise. Segmentation attempts to remove the noise and infer the location of discrete copy number alterations, i.e. the individual regions that have been duplicated or deleted. Segments that are not neutral (i.e. diploid, log2=0) are putative copy number alterations. CNVkit's call command helps infer more about the segments beyond their breakpoints.
So, in general, use the segments (.cns) for most follow-up analysis. The bin-level data (.cnr) is useful for plotting and showing the level of support for each segment, and for tracking down potential artifacts like especially noisy regions of the genome. Also, in the case of small or single-exon CNA, the .cns file will typically not include it but the .cnr file may show some evidence for the copy number change that you could then look to confirm independently.
Thank you very much.
Thank you very much.