Hello everyone,
The RNA-seq data from TCGA contain the Z-SCORE, instead of RPKM values. I wanted to perform analysis of TCGA data and GTEx data. But the problem is GTEx data contains RPKM and counts values. There is a tutorial explaining the Z-score calculation (http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/gene_expression_comparison.html), but I am not sure it works for RNA-seq or it is the correct tutorial.
Can someone please guide me how to calculate the z-score of genes using GTEx RNA-seq data (RPKM and count data). If possible verify the above link, is it correct code.
Thanks
Where are you downloading your TCGA data? TCGA Data Portal or elsewhere? TCGA provides 2 types of RNAseq data. The older pipeline is just called RNAseq, while the newer pipeline is referred to as RNAseqV2. Which one are you using? Both version provide the raw counts as well as RPKM for the old pipeline and RSEM scaled estimate counts for the RNAseqV2 data. These data are not z-scores. Please see this site for more details on RNAseqV2 data: https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2
Thanks for the reply and the link.
I have download the data directly from COSMIC. Below is the head of the expression file:
I think z-score is easy to understand and explain. So, I wanted to convert the GTEx data (counts or RPKM) values into z-score. Please advised how to convert the GTEx data into z-score, I have both the reads and counts. Below is the sample of GTEx data:
Thanks
I don't use COMIC much, so I researched this a little. My thought would be to calculate the z-score like COSMIC does, but the problem is that the help file they link to for more information is broken. You may want to contact them about this. Then I found this Biostars answer by @David Fredman TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls?, which I think answers your question. It looks like COSMIC uses the z-score so that they can have comparable gene expression across various platforms in TCGA; however, it is not clear if COSMIC calculated them or they were taken straight from TCGA. I think this is a good way to go since you are comparing yet another platform. However, note in the response by @David Fredman that TCGA uses the distribution of the normal tissues (sometimes!) as the control distribution. You should be careful comparing these z-scores to your data if you do not have a comparable control to use because you will get false positive results. My suggestions would be to 1) contact COSMIC to see how they calculated theses z-scores and 2) if that doesn't work out to download the original TCGA count data and calculate the z-scores yourself so that you know what is being used as the control distribution.
Just out of curiosity, are you trying to compare GTEx & TCGA? If yes, then why not use only TCGA as it already has data for matched Normal-Tumor samples.
Never gave it a thought. Thanks!