I have data from three tumor samples from the same patient at different stages. The samples are derived from FFPE sections, and they underwent exome sequencing.
I used CNVkit to calculate the copy number ratio and segment the genome with parameters -m cbs -t 1e-4 --drop-low-coverage --drop-outliers 5.
I noticed that I have copy number gain or loss throughout the entire genome, and when I plot the distribution of the log2 copy number ratio I get two peaks, above and below log2==0 (I would expect many segments to have log2==0). Is it possible that the log2 values are shifted in some way from log2==0?
When copy number ratios are calculated, you need to normalize tumor and normal for sequencing coverage differences. The problem is, however, that tumor cells can have more or less DNA than normal. Since you don't know the ploidy (average tumor copy number) when you calculate the log-ratios, you cannot adjust for ploidy differences at this step.
In your case, it looks like you have more losses than gains, so your tumor ploidy is below 2. By assuming a ploidy of 2 in the ratio calculation, the coverage is slightly overestimated and the peaks are right shifted. If the tumor ploidy would be > 2, then you would "remove" too many reads in the coverage normalization and the peaks would be left shifted.
See the ABSOLUTE paper for a nice explanation how tumor purity, ploidy and log-ratios relate to each other.
Yeah, it's not 100% clear from the plots, because it looks like the log ratios are not weighted by segment size. But the first peak is always higher than the third+ (assuming the second corresponds to normal 2). That's all.
cpad0112,
There is another support for that from the BubbleTree plots (generated using this R package). The x-axis presents the copy number ratio relative to the normal sample. The y-axis, HDS, is related to LOH events. It depicts the deviation from heterozygousity of SNPs in the tumor. So the peak (cluster of bubbles) with the lower copy number ratio values also shows extensive LOH. So it makes sense it represents only 1 copy of the chromosome.
To help address this issue in your data without a full analysis of tumor ploidy and heterogeneity, you can use the command call --center median. In CNVkit 0.8.5, this will take a "majority rules" decision on the appropriate center value by first taking the median log2 value of each chromosome, then the median of those values -- ensuring that at least one chromosome with fairly representative ploidy will be centered with a log2 value of 0.0.
It's still possible for this approach to leave the log2 values centered off from the true neutral value. In the development version of CNVkit (available on GitHub), there is a new option --center-at which lets you specify the log2 value that you have independently determined should be the neutral value.
@markus would like to know the basis of more losses than gains (from OP's post)? from right skew? Thanks
Yeah, it's not 100% clear from the plots, because it looks like the log ratios are not weighted by segment size. But the first peak is always higher than the third+ (assuming the second corresponds to normal 2). That's all.
cpad0112, There is another support for that from the BubbleTree plots (generated using this R package). The x-axis presents the copy number ratio relative to the normal sample. The y-axis, HDS, is related to LOH events. It depicts the deviation from heterozygousity of SNPs in the tumor. So the peak (cluster of bubbles) with the lower copy number ratio values also shows extensive LOH. So it makes sense it represents only 1 copy of the chromosome.
Thank you, Markus.