I want to identify differential expressed genes in TCGA. I did my anlysis as following way:
Taking BRCA dataset as an example, I downloaded RNA-seq data (RPKM version) from TCGA.
Then I used samples marked by “Primary Tumor” as cancer sample and samples marked by “Solid Tissue Normal” as normal sample.
Calculating Z score for each gene in each sample as following: z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)
Here I define genes with Zscore>=2 or Zscore<=-2 as up-regulated or down-regulated genes in one sample. If more than 10% samples in the whole dataset contain genes with Zscore>=2 or Zscore<=-2, I denfined these genes as up-regulated or down-regualted genes in BRCA dataset.
Then I find that the percentage of up-regualted genes is much higher than down-regulated genes. There are about 20% protein coding genes (number of up-regulated protein coding genes/total number of protein coding genes) which are identified as up-regulated genes but only 5% protein coding genes as down-regulated genes.
Is it normal?
Is there something wrong in my workflow?
Can you give me some sugesstions?
Thanks
Don't use RPKM (use raw counts if possible or rsem counts) and don't use Z-score for differential expression analysis (use solid statistical tools such as DESeq2/edgeR).
@OP, the point with using arbitrary thresholding like you do is that every NGS experiment has something called a mean-variance relationship. That means that genes (or regions or whatever you measure) may have high variation/enrichments because of small counts (here lowly expressed genes). These guys then often come out as differentially expressed if only looking at raw fold changes, but they rarely come out as statistically significant. Therefore, as WDC said, use a proper tool such as DESeq2 or edgeR, start with raw counts and use only genes that are significant.
Thanks, WouterDeCoster and ATpoint. I will check DESeq2/edgeR.
All I can add is that Z-scaling RPKM data does not seem like a great idea to me. Are you sure that it's not FPKM that you have obtained? There is a R package call zFPKM, which claims to be able to convert FPKM data to the Z-scale.
I would much prefer that you obtain the RSEM counts from TCGA (GDC Data Portal), normalise those in DESeq2 or EdgeR, and then Z-scale the regularised log (DESeq2) or lofCPM (EdgeR) counts.