Dear all,
I've encountered a problem regarding the normalization of RNA sequencing count files that would like ask for your advice.
Basically our goal is to study the differential expressed genes for pancreatic cancer. Here is our procedures for data preprosessing.
Downloaded the pancreatic cancer HTSeq raw count data (177 cancers and 4 normals) from GDC data portal.
Used the GDC RNA sequencing pipeline to process all the GTEx SRA data (fastq dump -> STAR 2 pass -> fixmate -> HTSeq).
Performed TMM normalization to GDC cancer, GDC normal, and GTEx normal count data.
Performed voom transform to normalized count file.
We want to see how well the data is normalized so we performed the PCA to our transformed data. We also plot the gene mean/median density across samples between GDC cases and GTEx normals as well as gene mean/median ratio distributions.
,
As you can see, GDC cancer and normal are kind of mixed together compared to the GTEx normal. The first peaks on the mean/median plot between GDC and GTEx are bit mismatched. The radio is also away from 1.
My question is: do above phenomena indicate that the TMM normalization is not suitable in this case and large portion of gene will be identified as differential expressed if we carry on to do the DE analysis?
Thank you very much for your help!
Hi Ted,
Clearly,GTEX normal form a separate cluster,seems to me a batch effect.I am not sure,TMM does batch correction. I think you should do batch correction before doing any comparison.
For differential Expression analysis(from counts),check this post for further links:
A: RNA sequencing data batch effect removal
For FPKMS based analysis,you can use Combat from "sva" bioconductor package to just to PCA for initial results.