Hi,
I have a callset from cancer NGS data (tumor/normal design, ~100 samples) and I have CNVs called there. Additional complexity is - some samples are close to tri- and tetra-ploid state.
I study impact of genomic variants on therapy response.
In some papers I've met definition "ploidy corrected CNVs" - but without clear explanation how it was done. If I understood correctly, people take average ploidy across the genome and consider everything that is below 0.5 copies below this baseline as deletion and 0.5 above the ploidy baseline as duplication. So diploid gene X in tetraploid sample is considered as largely deleted.
Questions:
1) How to correctly perform ploidy correction?
2) Should we correct for ploidy for largely deleterious tumors? E.g. where average ploidy is 1.5 (half of genome was deleted). Common sense tells me that no, but I wander if it is correct.
3) Bonus question: do you know btw how to take clonality (expressed as cancer cell fraction) of variant into account? Duplication in 10% of cells seems to affect therapy response less than duplication in 90%. But I also doubt that effect of CCF is linear.
upd: there is a related question, Ploidy in copy number analysis , however it is not exactly about this. The answer in the question is "Otherwise, I agree with you, calling relative to ploidy doesn't make a lot of sense biologically." - but I see that what is important is relative gene dosage, not absolute, and if everything is tetraploid - we may expect some sort of gene product balancing there.
Upd1: I apologize for wrong usage of terminology and I am blocked from any responses until tomorrow since I have reached 5 posts limit on this website.
Thanks for the answer! I use my own tool ClinCNV, it does all these things. I just wander how to interpret the data when I want to put calls in form (sample purity, sample ploidy, gene copy number, ccf of this gene copy number) into machine learning model. Because some ways of adjustment for ploidy provide me good results with lasso regression and some not, and it is not clear from the biological point of view what is deletion and what is duplication in non diploid sample. I understand your point 2), but for me sample with ploidy 1.25 is totally different from sample 2.5 even if it looks just a linear scaling - first one does not have a full diploid genome, so it looks like we should treat them differently, but idk if there are any literature sources for that.
In a regression, I would probably just use a gene-level log-ratio, ideally purity and ploidy adjusted, see for example Zack et al.. CCFs of CNAs are tricky, except maybe for single deletions and gains, so be careful here, GIGO. If ploidy adjustment has a big impact, then maybe what you are seeing is that genome doublings are correlated with your outcome of interest.
It looks like you came here for a discussion as opposed to having any particular question answered (?). You say you have your own tool, so, you obviously already understand the very question that you have asked (?). The answer by Markus is very good.
I personally only use Control-FREEC for everything copy number related these days. I actually still see some people merely comparing log2 ratios, like, just dividing them, and calling that copy number, even in cancer samples. Control-FREEC adjusts for purity, ploidy, and GC content, too.
I am quite confident in terms of cna calling. What I don't understand is the interpretation. I will try to re formulate my question. When we have tetraploid sample and gene X (and only gene X) has copy number 3 - is it deletion or duplication? How should I transform it to use as input for machine learning model which uses tumor growth as a response to therapy as Y?
Raw CN 3 in tetraploid would be 1 copy deletion, or, if the 3 relates to a linear ratio to a normal sample (also tetraploidy), then it is a raw CN of 12. To me, it makes more sense to use ploidy-corrected log2 ratios as input to any downstream modeling, but I have never exactly worked extensively in this area.
Yeap, and this is the thing... Samples are rarely clearly integer number ploid, e.g. https://iovs.arvojournals.org/article.aspx?articleid=2518413, fig 2, and like 20 percents of samples are polyploid, so throwing them out would be strange and our collaborators will not understand why we threw away 20 percents of their money and efforts :(
I don't understand why you would have to throw away the samples? I would just aim to adjust for the ploidy. You could even do some clonality analysis, broadly speaking. Some people I knew at UCL Cancer Institute have done some great work in that area.
thanks for the answer! do you know any papers from that UCL lab? And I still don't understand what does it mean to adjust for ploidy. Is it "to establish baseline correctly"?
Well, Im on one manuscript that is currently under review with them. The particular one to which I was referring though, is not yet published (and maybe not even submitted), but it related to being able to define a lineage of the clonotypes that exist in a tumour.
Yes, as far as I am aware, adjusting for ploidy means to bring everything to baseline.
You mead adjusting the measured log2-ratio for purity and ploidy? ABSOLUTE and similar algorithms calculate integer copy numbers for all segments. If you need adjusted log-ratios, you can do some simple algebra (see Zack et al.). End result is that diploid regions are 0 centered (because log2(2/2)), single losses at -1 (log2(1/2)), etc., independent of purity and ploidy.
They did a great job. I'll try to find the method Zack described in his paper (he cites it as unpublished) about maximum parsimoneous CNA sequence of events reconstruction. Hopefully it will help.
I mean, may be i explained it wrong, but basically my tool identifies diploid parts and perform calling with relation to diploid (just like facets does), so in theory no need for ploidy adjustment, I just thought that such adjustment is performed for better interpretability, not for better calling :(
We commented at the same time. It may be reasonable to assume diploid, but obviously you just have to state that in your methods. Not all tumours exhibit large-scale aberrations, after all. The way that Control-FREEC tests ploidy is by the user merely selecting ploidy states to test. The algorithm will then test each over each region, and choose the ploidy that best explains that data at hand.
Yeah, I am asking about tumors that exhibit such almost whole genome alterations :( I just can not throw them away because they are polyploid and there is no concensus, these samples were obtained with too much efforts from my colleagues. ControlFreec as far as I know does not work with heterogeneous tumors, but since all my samples passed several rounds of therapy - all of them consist of multiple clones due to selective pressure :(
Yes, this FACETS heuristic to find the diploid state works very well in the majority of cases, but indeed fails in cases without clear diploid state (it warns you though). The alternative is trying all reasonable purity and ploidy cases (e.g. what Sequenza and many others do), at the cost of larger runtime. Most tools support at least some degree of heterogeneity.
Thanks! will check sequenza.