TMM-Normalization

1

Entering edit mode

4.3 years ago

Filago ▴ 110

Hi, I want to calculate tissue specificity (e.g. z-scores) between GTEx tissues. I downloaded the raw counts, performed TMM normalization with edgeR and obtained raw counts. Next step in the pipeline would be to normalize to cpm counts. However, for some tissues like e.g. pancreas this strategy seems weird, as those tissues have a very small number of genes dramatically overexpressed (reducing CPM/TPM of other genes in those organs (dramatically)). An alternative that I am thinking about is to simply adjust my raw counts by normalized library sizes and go on with analysis.

What do you think about it?

Best, Andreas

TMM RNAseq-Normalization • 7.4k views

ADD COMMENT • link 3.7 years ago by Filago ▴ 110

12

Entering edit mode

4.3 years ago

ATpoint 88k

I see no reason to not use edgeR here. The point of TMM is exactly to avoid skewed normalization due to differences in composition. The process of edgeR::calcNormFactors followed by edgeR::cpm is to first calculate factors that correct for composition, combine them with the total library sizes into effective library sizes, and then apply the derived size factors to the raw counts. Be sure to first make a DGEList object:

y <- DGEList(your.counts)
y <- calcNormFactors(y)
cpm(y)

...as instructed in the manual. This will make sure the norm factors are being used properly.

The point of reduced counts that you mention is only relevant if you used a naive per-million scaling. The norm factors as said above will further correct the library size-scaled counts to correct for that.

But there is never a guarantee the method really captures differences if they are globally different between groups. You can diagnose this using MA-plots to see whether the bulk of genes gets somewhat properly centered at y=0. You could first average all samples of one tissue and then plot some tissues against each other on the MAs to see how the normalization performed.

These concerns are not theoretical, lets check it on the GTEx data. Code for the plot is below, here I obtained the raw counts from recount, and then looked at Lung vs Pancreas, either normalizing only by library size (here called naive) and by TMM.

MAplots

As you can see the naive method fails to properly center the bulk of genes (the very blue ones, this is a density plot, so very blue means lots of data points points) whereas TMM does the trick and corrects for the biased library composition. The top and bottom dashed lines indicate a fold change of 2 (on log2 scale), clearly the naive method suffers from a compositional bias which is not being accounted for.

Code:

	library(recount)
	library(edgeR)

	#/ Get the counts from GTEx via recount as a SummarizedExperiment
	#/ => 1.3GB file
	options(timeout=600)
	download_study("SRP012682", type = "rse-gene")
	load(file.path("SRP012682", "rse_gene.Rdata"))

	#/ remove whitespaces in tissue names:
	rse_gene$tissue <- gsub(" ", "_", rse_gene$smts)

	#/ see the tissues included:
	table(rse_gene$tissue)

	#/ make DGEList
	y <- DGEList(counts = assay(rse_gene, "counts"),
	group = rse_gene$tissue)

	#/ Subset the entire dataset to three tissues to make the process faster and less
	#/ memory intensive, for the sake of this post:
	y <- y[,y$samples$group %in% c("Heart", "Lung", "Pancreas")]

	#/ filter lowly-expressed genes:
	y <- y[filterByExpr(y, group=y$samples$group),]

	#/ naive CPMs, corrected only for library size:
	cpm_by_group_naive <- edgeR::cpmByGroup(y, log=TRUE)

	#/ and with the TMMs, corrected for composition:
	y <- calcNormFactors(y)
	cpm_by_group_TMM <- edgeR::cpmByGroup(y, log=TRUE)

	#/ Lets look at lung vs pancreas:
	naive <- cpm_by_group_naive[,c("Lung", "Pancreas")]
	TMM <- cpm_by_group_TMM[,c("Lung", "Pancreas")]

	#/ Make the MA-plot using smoothScatter:
	par(mfrow=c(1,2))
	for(i in c("naive", "TMM")){

	g1 <- get(i)[,1]
	g2 <- get(i)[,2]
	smoothScatter(x = 0.5 * (g1+g2), # average expr
	y = g1-g2, # fold change
	xlab = "average log expression",
	ylab = "log fold change",
	main = i,
	ylim=c(-6,6))
	abline(h=0, col="red")
	abline(h=log2(2), col="red", lty=2)
	abline(h=-log2(2), col="red", lty=2)

	}

view raw edgeR_TMM_vs_naive.R hosted with ❤ by GitHub

ADD COMMENT • link 4.3 years ago by ATpoint 88k

0

Entering edit mode

Thanks ATpoint, does TMM normalization account for gene length in addition to sequencing depth and RNA composition?