I see no reason to not use edgeR here. The point of TMM is exactly to avoid skewed normalization due to differences in composition. The process of edgeR::calcNormFactors
followed by edgeR::cpm
is to first calculate factors that correct for composition, combine them with the total library sizes into effective library sizes, and then apply the derived size factors to the raw counts. Be sure to first make a DGEList object:
y <- DGEList(your.counts)
y <- calcNormFactors(y)
cpm(y)
...as instructed in the manual. This will make sure the norm factors are being used properly.
The point of reduced counts that you mention is only relevant if you used a naive per-million scaling. The norm factors as said above will further correct the library size-scaled counts to correct for that.
But there is never a guarantee the method really captures differences if they are globally different between groups. You can diagnose this using MA-plots to see whether the bulk of genes gets somewhat properly centered at y=0. You could first average all samples of one tissue and then plot some tissues against each other on the MAs to see how the normalization performed.
These concerns are not theoretical, lets check it on the GTEx data. Code for the plot is below, here I obtained the raw counts from recount, and then looked at Lung vs Pancreas, either normalizing only by library size (here called naive) and by TMM.
As you can see the naive method fails to properly center the bulk of genes (the very blue ones, this is a density plot, so very blue means lots of data points points) whereas TMM does the trick and corrects for the biased library composition.
The top and bottom dashed lines indicate a fold change of 2 (on log2 scale), clearly the naive method suffers from a compositional bias which is not being accounted for.
Code:
Thanks ATpoint, does TMM normalization account for gene length in addition to sequencing depth and RNA composition?
Short answer: No it does not :)