Firstly limma, edgeR and DESeq2 do have normalization, its just not CPM/TPM/FPKM normalisation. There are three big differences between CPM/TPM/FPKM and the normalisation methods used by edgeR and DESeq2:
- TPM/FPKM normalise for gene length, and limma/edgeR/DESeq2 do not.
- CPM/TPM/FPKM are (more) susceptable to composition effects.
- edgeR/DESeq2 normalisation leaves counts on an integer scale, while CPM/TPM/FPKM puts things on a continuous scale.
Thus, we use TPM/FPKM when length matters, and we use edgeR/DEseq2 when being integer matters. As TPM/FPKM are very poor at dealing with compositional effects, we use edgeR/DESeq2/limma, all other things being equal.
First, very roughtly, we can start by saying that the number of reads observed for a gene g
in a sample s
is a product of several factors:
counts = length(g) * expression_level(g,s) * library_size(s)
We are generally interested in expression_level, but longer genes generate more counts for the same expression_level, and if you sequence 100 million reads then you will have higher counts than if you sequence 10 million reads.
When do we need integers?
Basically, we need integers any time we do any statistics that use a discrete probability distribution. The most obvious example here is differential gene expression using the negative binomial distribution, as implemented by edgeR and DESeq2. The negative binomial distribution tells us how likely it is to get X many reads from a gene given a certain expression level. I will tell us how that compares to X-1 or X+1, but it can't tell us the probability of X+0.5 or X+0.345345. And thats not just a requirement of the formula that can be overcome by rounding to the nearest integer - the numbers have to behave like they were generated by an count generating process, not just be whole numbers - this isn't the case when you divide a count by a gene length.
When do we need to correct for gene length?
If the number of reads for a gene in a sample is dependent on the length of the gene, then why do we not need to correct for gene length when doing differential gene expression? Because we are comparing two counts from the same gene. We are calculating a fold change:
fold_change = length(g) * expression_level(g,sample1) * library_size(sample1) / length(g) * expression_level(g,sample2) * library_size(sample2)`
= expression_level(g,sample1) * library_size(sample1)/expression_level(g, sample2) * library_size(sample2)
That is, the length is present on both the top and the bottom, so cancels out.
So when do we want to correct for length? When you are comparing two different genes. If I told you that I got 1000 reads from GAPDH and 2000 reads from ACTB, wouldn't know if that was because GAPDH was shorter, or because it was less highliy expressed.
Compositional effects?
Compositional effects arise because the number of reads is not actaully proportional the expression level of a gene, but the fraction of transcription that comes from a gene. if you have two genes in a sample, and say one is expressed at 499,000 units and the other at 1,000 and you sequence 1,000,000 reads, then you'll get 998,000 reads from gene 1 and 2,000 reads from gene 2. Now if you double the expression level of gene 1 to 998,000, but leave gene 2 where it is, you'll get 999,000 reads from gene 1, and only 1,000 reads from gene 2: doubling the expression of gene 1 has made it look like the expression of gene 2 has halved.
FPKM/CPM/TPM just divide by the total number of reads, and so are very suceptable to this. edgeR, DESeq2 and limma use more sophisticated methods for normalising for library size. They still suffer from this a bit, and it often introduces other assumptions, but they generally mitigate at least some of this effect.
There are plenty of papers about normalisation of RNAseq data, but you could probably do worse than starting with the original DESeq and edgeR papers.
Thank you for your kind and accurate response to my question. So, if I understand correctly, normalization methods such as CPM, FPKM, or RPKM are used to compare gene expression between different genes. However, if one wants to compare Differentially Expressed Genes (DEGs) in other samples, is it better to use raw counts?
I also have additional questions. If researchers have uploaded FPKM files in databases like GEO for DEG analysis, why do they choose this approach when they are aware that raw counts are preferable?
Moreover, if raw files (.fasta) are not available, how can I perform analysis using FPKM or TPM values? Are there any libraries or packages that can assist with this?
Lastly, could you clarify the concept of compositional effect? You provided an example to aid understanding, but I'm still unclear. In the example with two genes (Gene 1 and Gene 2), where read(a) and read(b) show different counts, and you explained that if the total reads increase, the composition may change. Is the increase in Gene 1 in read(b) compared to read(a) an example of compositional effects?
I appreciate your assistance.
For DEGs, you plug raw counts into a program like DESeq2 and it'll transform your raw counts into something that can be used to properly identify DEGs. In order to use DESeq2, you need the raw counts to begin with (you don't throw FPKMs into DESeq2).
A lot of FPKM files exist on GEO because when RNA-seq was first "invented" in 2008, FPKM was one of the first normalization methods used and became in widespread use. After much benchmarking over the next few years, people realized that it's not the correct thing to do. So, yeah, unfortunately, FPKM just stuck around and is what you'll see in old datasets. If you see FPKMs in newer datasets, that means the scientists are not bioinformaticians and therefore are not aware of the shortcomings of FPKM.
Raw FASTQ files should always be available unless protected because it's human private health information. But if you only have FPKMs, here are suggestions from the author of limma-voom: https://support.bioconductor.org/p/56275/#56299
For the compositional effects, if you need another example, see my classmate's post at: https://bentyeh.github.io/blog/20200502_gene-expression-analysis.html#naive-normalization
Your response to my post has been immensely helpful. I recently uploaded another question related to this topic. If you have the time, I would be very grateful if you could also provide an answer to this question.