Question

correlating focal CNV values with RNA-seq expression

0

Entering edit mode

5.0 years ago

demoraesdiogo2017 ▴ 110

Hello

I would like to use the masked copy number segment from TCGA found on Xena browser and correlate it with gene expression values. Both data sets can be found here https://xenabrowser.net/datapages/?cohort=GDC%20TCGA%20Liver%20Cancer%20(LIHC)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

Xena webtools itself enables this sort of correlation with plausible results - CNV values often do correlate with expression level. On the other hand, I see that in the GISTIC pipeline only genes with a value =< -0.3 or => 0.3 are considered as deleted/amplified.

I see a few other papers that did something similar, but I would like to know if a statistical test like this makes sense and, if so, what sort of expression normalization would be the best.

CNV RNA-seq • 1.9k views

ADD COMMENT • link updated 5.0 years ago by jared.andrews07 ★ 18k • written 5.0 years ago by demoraesdiogo2017 ▴ 110

0

Entering edit mode

what exactly is your statistical test?

ADD REPLY • link 5.0 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Pearson correlation, forgot to add

ADD REPLY • link 5.0 years ago by demoraesdiogo2017 ▴ 110

0

Entering edit mode

Pearson correlation may be tricked in many ways =) https://en.wikipedia.org/wiki/Anscombe%27s_quartet - would not recommend it (only if you are particularly interested in this one and are sure that the data follows your assumptions)

ADD REPLY • link 5.0 years ago by German.M.Demidov ★ 2.9k

score 2 · Answer 1 · 2019-11-13

2

Entering edit mode

5.0 years ago

jared.andrews07 ★ 18k

This is going to be heavily sample-dependent, particularly if you have mixed tissue where disease burden may vary. The amplification/deletion thresholds can (and should) change between different samples, often requiring the user to kind of set it by eye. I would likely just directly compare normalized gene counts (via salmon/kallisto -> DESeq2/edgeR or similar) in altered regions between samples with the CNV(s) of interest versus those without via a Mann Whitney U-test or something similar.

ADD COMMENT • link 5.0 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

dichotomization is bad =( why not Spearman correlation?

ADD REPLY • link 5.0 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Care to elaborate? If it fits your question, I see no reason why dichotomization is inherently bad. Then again, I'm no stats guru. You can totally take a correlation approach too, which may be a bit easier if you have a lot of variability in the log2 ratios for CNAs, as you can avoid setting a hard and fast threshold for CNAs and just correlate the gene counts with the log2 ratios instead.

ADD REPLY • link 5.0 years ago by jared.andrews07 ★ 18k

2

Entering edit mode

Your approach is valid, of course, I was mainly thinking about difficult cases: malignancies of low purity (rich of stromal and immune cells, which will be cut out by hard thresholds) or genes with small changes in expression: dichotomization lowers the power of detection. Small sample size is a difficult case too, as well as samples with small subclones. I am actually not sure if correlation approach is better, that's why I asked. Simulations are needed.

ADD REPLY • link 5.0 years ago by German.M.Demidov ★ 2.9k

2

Entering edit mode

That's fair, all good points. Both are pretty easy to do, and I'd expect them to be complementary. Thanks for explaining.

ADD REPLY • link 5.0 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

Thank you both for the insightful discussion. Let me get into more details of our pipeline: A lot of samples do have low purity and likely have normal tissue mixed in (the truth about TCGA. a lot of samples but low quality imo). What I did first was a global comparison between groups (discovery set) I suspect to be different regarding CNVs and found a few amplified genes on a high risk group using a chi-square test based on the GISTIC threshold. Then I wanted to see if patients with those amplified genes really had low survival (validation set), which did happen. Now I want to see if patients with those CNVs indeed express them more, and this is how we thought about using a Pearson correlation test between the continuous focal CNVs value in all samples (both validation and discovery).

I was reluctant to use Pearson because we discovered those genes using dichotomous tests first. If you both believe those tests should complement each other I feel more confident in this pipeline now. Does that makes sense?

ADD REPLY • link 5.0 years ago by demoraesdiogo2017 ▴ 110

1

Entering edit mode

I would modify this a bit, but it is too long too elaborate and I believe that you know your data better than I do, so it is fine. Check out fig 4 from this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4405206/ - may be using Pearson correlation which assumes linearity and homogenity of variance [to be meaningful] will not be good. Use Spearman instead (I know I said it several times already but I repeat once again XD)

ADD REPLY • link 5.0 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Haha I must say even though I know the data I don't feel 100% confident of our decisions yet. Thank you for the paper, I believe we will do a similar box plot on fig 4 of the genes we found. Regarding Pearson, we also tested for log2 transformed values, but now you mention Spearman's test it makes sense it is superior and I believe it would make log2 transformation not necessary, which is nice.

ADD REPLY • link 5.0 years ago by demoraesdiogo2017 ▴ 110

1

Entering edit mode

I would not call Spearman superiour - you just take ranks instead of your actual variables and then calculate the same Pearson correlation, but it is definetly more appropriate + it measures all the monotonic dependencies, not only linear ones, so log2 will not play any role. For you scatterplot would be a better option since you don't have integer copy numbers.

ADD REPLY • link 5.0 years ago by German.M.Demidov ★ 2.9k