Hi all,
I need some help with understanding the output of CNVkit, specifically Segmented log2 ratios (.cns) and the exported CNVs in VCF format.
I'm looking to get the copy number of every region found by CNVkit.
For Segmented log2 ratios (.cns) file:
Please correct me if i'm wrong: To get the actual estimated copy number I should simply anti-log the log2 column, right?
In that case,
What's the correlation/connection, if there is any, between the log2 value in the .cns file and the inferred SVLEN value in the .vcf file?
Is there any connection to the "CN" (copy number genotype...) value in the .vcf? Also, why does "CN" only appears in duplication events? I tried calculating the copy number from the log2 value in the .cns file but the values are different than what I expected.
An example if it helps:
For a certain region in the .cns file, the log2 value is 0.382191.
In the .vcf file, SVLEN is 6812878 and the CN value is 3. What is the copy number then?
Thanks,
Alon
Thank you very much for the detailed answer!
I assume that for copy number losses the calculation is the same?
Also, how would I go about finding the "original" (reference) copy number so I could find out exactly what was the copy number before said mutation / aberration?
Yes, the calculation is the same for copy number losses.
To find the copy number status of the normal sample, just run the same pipeline on it. If the normal sample was included in the CNV reference (cnv_reference.cnn), you can alternatively run the pipeline on the normal sample using a "flat" reference instead.
Great. If I don't have a normal sample should I create a "flat" reference from the reference.fa file and then run batch on it to get the normal copy number so I could calculate the exact copy number loss/gain?
To be more specific, I want to get the original copy number so I could know the number of copies in the ref as opposed to the tumor. For example, I now see that I have 3 times the copy number of the ref, but what was the original copy number and consequently, what's the tumor copy number?
Yes. The original copy number is the ploidy of your organism, e.g. humans are diploid, 2 copies of each autosome, and the sex chromosomes are XX or XY normally. If you use a flat reference for both tumor and normal, then you can interpret the log2 values as they are. If you used a single normal reference, then you should first check that the normal sample is copy-number-neutral at the location of interest (it probably is) before interpreting the tumor log2 ratio.
Regarding SVLEN -- this is just the length of the altered genomic region, in basepairs. It's not related to the log2 value or copy number.
So is it safe to assume that if the log2 < 0 it's a deletion, otherwise a dup?
Sort of. If the log2 value close to 0 it could instead just be noise or imperfect centering. But it is true that the neutral value, i.e. cutoff between loss and gain, is zero. In array CGH analysis (which CNVkit mimics) it's common to treat log2 values between +/- 0.2 as effectively neutral copy number, and focus on greater deviations from zero.