Hi everyone
I am new here and at the bioinformatics world and I would appreciate your help. I am currently looking into correlating gene expression and CNV data from TCGA, most probably about colorectal or ovarian cancer. After some data exploration, I found out than only a small percentage of samples are from normal tissues. That being said, should the DEGs identification be done only between paired (tumor - normal) samples, even if the statistical power would be low? With the aim of correlating the above mentioned data, a meaningful correlation analysis would be 1. between DEGs and amplified/deleted genes or 2. correlation between the expression (not taking into account differential expression, but all the expression data from tumor samples) and the CNV?
Thanks for helping!
Thank you a lot for your comments and help, Kevin!
Hi again!
Looking at it a little more, I have seen that a NB distribution is used from DEseq2 and EdgeR to normalize gene expression data, meaning that a Z-score would not be valid ( or at least have similar interpretation) as if when using a normal distribution. Am I understanding something wrong?
Elaborating a little more to make myself as clear as possible. A possible workflow would be:
Are there any steps that I am missing/not understanding correctly? In other words, normalization techniques proposed, such as those using median or quantiles, should not be considered?
When it comes to the correlation: CNVs will have to be done by pairwise comparison of normal-tumor samples. Would it still be valid to correlate them to the genes found from the process above?
Thanks once again!
The idea was to download TCGA RSEM counts, normalise them in DESeq2 / EdgeR, produce logged data from this (via regularised log in DESeq2 or logCPM in EdgeR), and then transform to Z-scale. I would then obtain the CN segment data from Broad Institute's Firebrowse server, and, finally, conduct either a correlation or regression analysis between the RNA-seq genes with |Z|>2 or 3 and the CN segments identified. There will obviously be other issues along the way.
Hi Kevin! Coming back to this (almost ancient now) post for a follow-up question!
I used indeed DESeq2 with a design of ~tumor + normal. One think I am quite unsure about is whether I should compute the Z-score, as (expression in my samples - mean expression in normal samples) / st. dev of expression in normal samples from the results of rlog.
Am I missing something?
Thank you!
Hey rin, To transform to Z-scores, you just need to do: