Is it more reliable to normalize a set of specific transcript abundances by the mean or median of the total coding transcript abundance and if so why?
Is it more reliable to normalize a set of specific transcript abundances by the mean or median of the total coding transcript abundance and if so why?
When you mention Normalization, it's important to specify for what purpose. Are you aiming to compare transcript abundance within a sample? Between samples? Both? The answer may also be technology dependent. If this is RNA-Seq data, there's a common problem where a small number of genes take up the majority of sequence reads, and can thus skew the distribution of reads available to other transcripts in the sample. This can make the median unstable (Bullard et al., 2010), and make the mean less meaningful. For comparing between samples, most people use the Trimmed Mean of M values (Robinson & Oshlack, 2010). Reading your question at face value, if you just want a rough comparison of transcripts within a sample that works most of the time for some limited purpose, I would probably consider the median a robust value, but you could easily chose some other quantile (e.g. 75%, or why not some "control" transcript?). Otherwise, there are a variety of methods and issues (Zhao et al., 2021) involved. Perhaps you could read a little and clarify what you're trying to achieve.
Thanks a lot for this guide. I read Bullard et al., 2010 and Robinson & Oshlack, 2010 and worked from these. I want to use Weighted Gene Co-expression Network Analysis on the TCGA data but noted on the tsv files there is a column that gives the tpm-unstranded figure for all of the genes so this type of normalization has already been done. If you have any reservations about putting these numbers into the https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/ then please let me know.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
whats your analytical goal.