Which normalization method to use FPKM/TPM?
3
0
Entering edit mode
6.2 years ago
glady ▴ 320

Hello, If I have three conditions with 3 replicates each. And If I want to check the expression of a particular Gane A across these conditions, what normalization technique would be better FPKM/TPM? And can we use these normalized counts FPKM/TPM counts for performing differential expression study, with DESeq2 / limma?

RNA-Seq ChIP-Seq • 5.3k views
ADD COMMENT
4
Entering edit mode

The TPM is always recommended over FPKM / RPKM, because in the later case, the normalized counts are not comparable across the samples

https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html

ADD REPLY
0
Entering edit mode

I agree that TPM is prefered - but generally I would say that is due to the normalization to effective transcript length (instead of annotated) and other sequncing biases.

Furthermore FPKM/RPKM are only unstrustworthy in cases where you have a global shift in the length distribution of transcripts - which in my experience is quite rare.

ADD REPLY
1
Entering edit mode

In fact, it is FPKM / RPKM that are rarely suitable for differential expression analysis because there is no cross-sample normalisation performed when deriving these units..

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLY
0
Entering edit mode

could you clarify what do you mean by that?

ADD REPLY
1
Entering edit mode
6.2 years ago

If I were you, I would normalise the raw counts in DESeq2 and then compare the conditions in a pairwise fashion, and also via ANOVA / Likelihood Ratio Test. So, with 3 conditions, you would have 4 results tables. So:

  • A vs B
  • B vs C
  • A vs C
  • A vs B vs C

Doing that will already give you a p-value and log [base 2] fold change for your gene, and every other gene.

You can also compare expression levels of the gene visually by using the normalised counts or the regularised log-transformed or variance stabilised transformed counts

Kevin

ADD COMMENT
0
Entering edit mode

I didn't understand the fourth condition. How different is the fourth condition (A vs B vs C) from the above three? What different results should we except from this? How would I interpret its results? Because we might already get the list of DE genes & isoforms from the first three condition.

ADD REPLY
1
Entering edit mode

A vs B will find genes that differ (statistically significantly) between condition A and condition B. These statistically significantly differentially expressed genes say nothing about condition C. So, it is reasonable to assume that a proportion of these genes will have equivalent expression levels in C as they do in either A or B.

A vs B vs C, however, will, generally-speaking, find those genes that differ between all conditions, i.e., it is essentially an ANOVA / Likelihood Ratio Test

ADD REPLY
0
Entering edit mode

Okay, thank you for the explanation.

ADD REPLY
1
Entering edit mode
6.2 years ago
vj ▴ 520

TPM is probably a good way to go about getting normalized counts. However, using normalised counts as a starting point for DESeq2 is a big NO. You could optionally use the fpkm function in DESeq to get FPKM values.

ADD COMMENT
0
Entering edit mode

For miRNA samples, is there a need to consider the gene length while calculating the normalized counts like rpkm? Since, the all the miRNA have almost the same length.

ADD REPLY
0
Entering edit mode

Perhaps not. In that case you can use RPM but if you are going to use the term "RPKM" then by definition it requires length normalisation. In this case, obviously RPKM ~ n * RPM, where "n" is just a constant, making pretty much no difference.

Vijay

ADD REPLY
0
Entering edit mode

Please keep in mind that neither FPKM nor RPKM are suitable for differential expression analysis...

ADD REPLY
0
Entering edit mode

For studying the differentially expressed transcripts/isoforms, can we use the raw counts. Because most of the algorithms give a strange output when isoform raw counts are used as inputs for differential expression study. Eg, DESeq2, limma.

I was planning to use tximport on the isoform raw counts and then go for a DESeq2 analysis. Is this the right way?

ADD REPLY
0
Entering edit mode

If you want to gauge differential expression over isoforms, then EdgeR's diffSpliceDGE() would be better than DESeq2. DEXSeq is another option.

If you want differential expression over genes, then, yes, use tximport to input the raw counts to DESeq2, where they will be normalised. This is outlined in the DESeq2 Vignette

ADD REPLY
0
Entering edit mode

Thank you for the comments.

ADD REPLY
0
Entering edit mode

Should the sum of the RPM column be precisely 1 M, for each miRNA sample? Because now when I calculate the sum of RPM, it is around 0.5 million

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Yes, by definition it should be 1M ! (If you sequence a million reads, how will it be distributed among your miRNA)

ADD REPLY
0
Entering edit mode

I divided the read counts by total no. of reads in the sample, instead of "total number of mapped reads". I guess, this was the mistake

ADD REPLY
0
Entering edit mode

I guess by RPM, you mean RPKM.

Whatever you will do, the normalization will not yield 1M. The reason is that there is a double normalization (division) by library size and gene-length. If it has been only lib-size, they would add up to 1M. But then you also divide by gene-length and that will change (lower) the sum from 1M. And this is exactly the problem with RPKM as the sum will vary in different samples according to both the size and expression level of genes.

A rough estimate is this: You are getting 0.5 as RPM. You need to multiply everything by 2 to make 1. That means, the library mapping rate is 0.5 only, and that's a very low mapping rate for RNAseq data, and could be a problem in the data.

ADD REPLY
0
Entering edit mode
6.1 years ago

You should use the raw counts for DE analysis - see this well written workflow.

ADD COMMENT

Login before adding your answer.

Traffic: 2414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6