correlating focal CNV values with RNA-seq expression
1
0
Entering edit mode
5.0 years ago

Hello

I would like to use the masked copy number segment from TCGA found on Xena browser and correlate it with gene expression values. Both data sets can be found here https://xenabrowser.net/datapages/?cohort=GDC%20TCGA%20Liver%20Cancer%20(LIHC)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

Xena webtools itself enables this sort of correlation with plausible results - CNV values often do correlate with expression level. On the other hand, I see that in the GISTIC pipeline only genes with a value =< -0.3 or => 0.3 are considered as deleted/amplified.

I see a few other papers that did something similar, but I would like to know if a statistical test like this makes sense and, if so, what sort of expression normalization would be the best.

CNV RNA-seq • 1.9k views
ADD COMMENT
0
Entering edit mode

what exactly is your statistical test?

ADD REPLY
0
Entering edit mode

Pearson correlation, forgot to add

ADD REPLY
0
Entering edit mode

Pearson correlation may be tricked in many ways =) https://en.wikipedia.org/wiki/Anscombe%27s_quartet - would not recommend it (only if you are particularly interested in this one and are sure that the data follows your assumptions)

ADD REPLY
2
Entering edit mode
5.0 years ago

This is going to be heavily sample-dependent, particularly if you have mixed tissue where disease burden may vary. The amplification/deletion thresholds can (and should) change between different samples, often requiring the user to kind of set it by eye. I would likely just directly compare normalized gene counts (via salmon/kallisto -> DESeq2/edgeR or similar) in altered regions between samples with the CNV(s) of interest versus those without via a Mann Whitney U-test or something similar.

ADD COMMENT
0
Entering edit mode

dichotomization is bad =( why not Spearman correlation?

ADD REPLY
0
Entering edit mode

Care to elaborate? If it fits your question, I see no reason why dichotomization is inherently bad. Then again, I'm no stats guru. You can totally take a correlation approach too, which may be a bit easier if you have a lot of variability in the log2 ratios for CNAs, as you can avoid setting a hard and fast threshold for CNAs and just correlate the gene counts with the log2 ratios instead.

ADD REPLY
2
Entering edit mode

Your approach is valid, of course, I was mainly thinking about difficult cases: malignancies of low purity (rich of stromal and immune cells, which will be cut out by hard thresholds) or genes with small changes in expression: dichotomization lowers the power of detection. Small sample size is a difficult case too, as well as samples with small subclones. I am actually not sure if correlation approach is better, that's why I asked. Simulations are needed.

ADD REPLY
2
Entering edit mode

That's fair, all good points. Both are pretty easy to do, and I'd expect them to be complementary. Thanks for explaining.

ADD REPLY
0
Entering edit mode

Thank you both for the insightful discussion. Let me get into more details of our pipeline: A lot of samples do have low purity and likely have normal tissue mixed in (the truth about TCGA. a lot of samples but low quality imo). What I did first was a global comparison between groups (discovery set) I suspect to be different regarding CNVs and found a few amplified genes on a high risk group using a chi-square test based on the GISTIC threshold. Then I wanted to see if patients with those amplified genes really had low survival (validation set), which did happen. Now I want to see if patients with those CNVs indeed express them more, and this is how we thought about using a Pearson correlation test between the continuous focal CNVs value in all samples (both validation and discovery).

I was reluctant to use Pearson because we discovered those genes using dichotomous tests first. If you both believe those tests should complement each other I feel more confident in this pipeline now. Does that makes sense?

ADD REPLY
1
Entering edit mode

I would modify this a bit, but it is too long too elaborate and I believe that you know your data better than I do, so it is fine. Check out fig 4 from this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4405206/ - may be using Pearson correlation which assumes linearity and homogenity of variance [to be meaningful] will not be good. Use Spearman instead (I know I said it several times already but I repeat once again XD)

ADD REPLY
0
Entering edit mode

Haha I must say even though I know the data I don't feel 100% confident of our decisions yet. Thank you for the paper, I believe we will do a similar box plot on fig 4 of the genes we found. Regarding Pearson, we also tested for log2 transformed values, but now you mention Spearman's test it makes sense it is superior and I believe it would make log2 transformation not necessary, which is nice.

ADD REPLY
1
Entering edit mode

I would not call Spearman superiour - you just take ranks instead of your actual variables and then calculate the same Pearson correlation, but it is definetly more appropriate + it measures all the monotonic dependencies, not only linear ones, so log2 will not play any role. For you scatterplot would be a better option since you don't have integer copy numbers.

ADD REPLY

Login before adding your answer.

Traffic: 2334 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6