Question

question about identifying differential expressed genes in TCGA

0

Entering edit mode

7.3 years ago

tujuchuanli ▴ 140

I want to identify differential expressed genes in TCGA. I did my anlysis as following way:

Taking BRCA dataset as an example, I downloaded RNA-seq data (RPKM version) from TCGA.
Then I used samples marked by “Primary Tumor” as cancer sample and samples marked by “Solid Tissue Normal” as normal sample.
Calculating Z score for each gene in each sample as following: z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)
Here I define genes with Zscore>=2 or Zscore<=-2 as up-regulated or down-regulated genes in one sample. If more than 10% samples in the whole dataset contain genes with Zscore>=2 or Zscore<=-2, I denfined these genes as up-regulated or down-regualted genes in BRCA dataset.

Then I find that the percentage of up-regualted genes is much higher than down-regulated genes. There are about 20% protein coding genes (number of up-regulated protein coding genes/total number of protein coding genes) which are identified as up-regulated genes but only 5% protein coding genes as down-regulated genes.

Is it normal?

Is there something wrong in my workflow?

Can you give me some sugesstions?

Thanks

TCGA • 2.4k views

ADD COMMENT • link updated 7.3 years ago by Shicheng Guo ★ 9.6k • written 7.3 years ago by tujuchuanli ▴ 140

1

Entering edit mode

Don't use RPKM (use raw counts if possible or rsem counts) and don't use Z-score for differential expression analysis (use solid statistical tools such as DESeq2/edgeR).

ADD REPLY • link 7.3 years ago by WouterDeCoster 48k

1

Entering edit mode

@OP, the point with using arbitrary thresholding like you do is that every NGS experiment has something called a mean-variance relationship. That means that genes (or regions or whatever you measure) may have high variation/enrichments because of small counts (here lowly expressed genes). These guys then often come out as differentially expressed if only looking at raw fold changes, but they rarely come out as statistically significant. Therefore, as WDC said, use a proper tool such as DESeq2 or edgeR, start with raw counts and use only genes that are significant.

ADD REPLY • link 7.3 years ago by ATpoint 90k

0

Entering edit mode

Thanks, WouterDeCoster and ATpoint. I will check DESeq2/edgeR.

ADD REPLY • link 7.3 years ago by tujuchuanli ▴ 140

1

Entering edit mode

All I can add is that Z-scaling RPKM data does not seem like a great idea to me. Are you sure that it's not FPKM that you have obtained? There is a R package call zFPKM, which claims to be able to convert FPKM data to the Z-scale.

I would much prefer that you obtain the RSEM counts from TCGA (GDC Data Portal), normalise those in DESeq2 or EdgeR, and then Z-scale the regularised log (DESeq2) or lofCPM (EdgeR) counts.

ADD REPLY • link 7.3 years ago by Kevin Blighe 90k

score 2 · Answer 1 · 2018-07-18

2

Entering edit mode

7.3 years ago

Shicheng Guo ★ 9.6k

In the step 4, for certain gene, they will be assigned to 'Up-regulation' and 'Down-regulation' at the same time.

ADD COMMENT • link 7.3 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

Thanks, Shicheng Guo. Do you have some suggestions to these genes? I prefer to retain these genes. I consider a gene as up-regulated gene if this gene is up-regulated in more than 10% samples. Even if the same gene with down-regulation in more than 10% sample. Under this condition, I prefer to consider it as down-regulated gene in more than 10% sample. Is it reseasonable?

ADD REPLY • link 7.3 years ago by tujuchuanli ▴ 140