Question

TCGA - Correlation between gene expression and CNV

0

Entering edit mode

6.4 years ago

rin ▴ 40

Hi everyone

I am new here and at the bioinformatics world and I would appreciate your help. I am currently looking into correlating gene expression and CNV data from TCGA, most probably about colorectal or ovarian cancer. After some data exploration, I found out than only a small percentage of samples are from normal tissues. That being said, should the DEGs identification be done only between paired (tumor - normal) samples, even if the statistical power would be low? With the aim of correlating the above mentioned data, a meaningful correlation analysis would be 1. between DEGs and amplified/deleted genes or 2. correlation between the expression (not taking into account differential expression, but all the expression data from tumor samples) and the CNV?

Thanks for helping!

RNA-Seq R correlation tcga cnv • 2.4k views

ADD COMMENT • link updated 6.4 years ago by Kevin Blighe 88k • written 6.4 years ago by rin ▴ 40

score 1 · Answer 1 · 2018-07-20

1

Entering edit mode

6.4 years ago

Kevin Blighe 88k

Yes, the number of Tumour-Normal pairs in the TCGA RNA-seq data is low. Others have somewhat circumvented this issue by not doing any direct comparisons and instead answering the question: 'What is highly and lowly expressed in the tumour and normal samples separately?' This is how cBioPortal does it, and the default is Z-score > 2 for highly expressed and Z-score < 2 for lowly expressed. Z-scores should ideally be produced from the logged, normalised counts.

I would take this approach (above) and correlate the highly and lowly expressed genes to the CNVs.

Of course, any logical approach will be fine.

Kevin

ADD COMMENT • link 6.4 years ago by Kevin Blighe 88k

1

Entering edit mode

Thank you a lot for your comments and help, Kevin!

ADD REPLY • link 6.3 years ago by rin ▴ 40

0

Entering edit mode

Hi again!

Looking at it a little more, I have seen that a NB distribution is used from DEseq2 and EdgeR to normalize gene expression data, meaning that a Z-score would not be valid ( or at least have similar interpretation) as if when using a normal distribution. Am I understanding something wrong?

Elaborating a little more to make myself as clear as possible. A possible workflow would be:

Check if raw count data downloaded from TCGA follow a normal distribution.
If not, log2 transform.
Remove genes with low read counts.
Calculate mean and st.dev of Gene A across samples >> Get a z-score for Gene A
Repeat for all genes.
Select genes with score > or < 2.

Are there any steps that I am missing/not understanding correctly? In other words, normalization techniques proposed, such as those using median or quantiles, should not be considered?

When it comes to the correlation: CNVs will have to be done by pairwise comparison of normal-tumor samples. Would it still be valid to correlate them to the genes found from the process above?

Thanks once again!

ADD REPLY • link 6.3 years ago by rin ▴ 40

0

Entering edit mode

The idea was to download TCGA RSEM counts, normalise them in DESeq2 / EdgeR, produce logged data from this (via regularised log in DESeq2 or logCPM in EdgeR), and then transform to Z-scale. I would then obtain the CN segment data from Broad Institute's Firebrowse server, and, finally, conduct either a correlation or regression analysis between the RNA-seq genes with |Z|>2 or 3 and the CN segments identified. There will obviously be other issues along the way.

ADD REPLY • link 6.3 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin! Coming back to this (almost ancient now) post for a follow-up question!

I used indeed DESeq2 with a design of ~tumor + normal. One think I am quite unsure about is whether I should compute the Z-score, as (expression in my samples - mean expression in normal samples) / st. dev of expression in normal samples from the results of rlog.

Am I missing something?

Thank you!

ADD REPLY • link 6.2 years ago by rin ▴ 40

0

Entering edit mode

Hey rin, To transform to Z-scores, you just need to do:

t(scale(t(data)))

ADD REPLY • link 6.2 years ago by Kevin Blighe 88k