Question

GDC & GTEx RNA sequencing normalization problem

0

Entering edit mode

8.0 years ago

Ted • 0

Dear all,

I've encountered a problem regarding the normalization of RNA sequencing count files that would like ask for your advice.

Basically our goal is to study the differential expressed genes for pancreatic cancer. Here is our procedures for data preprosessing.

Downloaded the pancreatic cancer HTSeq raw count data (177 cancers and 4 normals) from GDC data portal.
Used the GDC RNA sequencing pipeline to process all the GTEx SRA data (fastq dump -> STAR 2 pass -> fixmate -> HTSeq).
Performed TMM normalization to GDC cancer, GDC normal, and GTEx normal count data.
Performed voom transform to normalized count file.

We want to see how well the data is normalized so we performed the PCA to our transformed data. We also plot the gene mean/median density across samples between GDC cases and GTEx normals as well as gene mean/median ratio distributions.

PCA , enter image description here

As you can see, GDC cancer and normal are kind of mixed together compared to the GTEx normal. The first peaks on the mean/median plot between GDC and GTEx are bit mismatched. The radio is also away from 1.

My question is: do above phenomena indicate that the TMM normalization is not suitable in this case and large portion of gene will be identified as differential expressed if we carry on to do the DE analysis?

Thank you very much for your help!

RNA-Seq R • 3.0k views

ADD COMMENT • link 8.0 years ago by Ted • 0

0

Entering edit mode

Hi Ted,

Clearly,GTEX normal form a separate cluster,seems to me a batch effect.I am not sure,TMM does batch correction. I think you should do batch correction before doing any comparison.

For differential Expression analysis(from counts),check this post for further links:

A: RNA sequencing data batch effect removal

For FPKMS based analysis,you can use Combat from "sva" bioconductor package to just to PCA for initial results.

ADD REPLY • link 8.0 years ago by Ron ★ 1.2k