Question

normalization RNA-seq data between different types of libraries

5

Entering edit mode

10.5 years ago

ktian.whu ▴ 50

Dear members,

I have several RNA-seq data sets from published articles. We think integration of them may be interesting. But they come from different types of libraries (single-end/paired-end, PolyA+/rRNA depletion). How can I normalize them and in this way, compare their expression directly?

I read an article using different types of data just normalizing them by RPKM, is this a good method?

rpkm RNA-Seq fpkm • 6.5k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by ktian.whu ▴ 50

0

Entering edit mode

I really wonder why these many RPKM/FPKM questions come up lately.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by Michael 55k

0

Entering edit mode

Any update on this work? I am planning for something similar.

ADD REPLY • link 5.4 years ago by Arindam Ghosh ▴ 540

Ram · Answer 1 · 2014-07-04

2

Entering edit mode

10.5 years ago

Michael 55k

FPKM/RPKM is not a normalization method. It is a unit, meant for representing some sort of molar concentration of transcripts.

In my experience, if you have very different libraries, FPKM will confound biological comparability instead of making it more comparable. Calculating FPKM (imho) does not remove technical bias, but adds an unknown proportionality factor to each library, which is based on library composition. Therefore if you have very different protocols it will not help to length-scale abundances. Problems of using FPKM have been stated in several posts, e.g.:

Some more problems:

FPKM/RPKM has never been peer-reviewed, it has been introduced as an ad-hoc measure in a supplementary
One of the authors of this paper states, that it should not be used because of faulty arithmetic
All reviews so far have shown it to be an inferior scale for DE analysis of genes
Length normalization is mostly dispensable imo in DE analysis because gene length is constant

Using FPKM scaled values for your use-case might be the worst possible application.

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by Michael 55k

0

Entering edit mode

Would it be right if we quantile normalize the dataframe containing FPKM values of genes obtained from Cufflinks, where samples in dataframe had different types of libraries?

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.3 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

From my data, it doesn't seem that quantile normalization cures it. Of course this is a very limited view. I can only say that observations on our data correlate with the published concerns, and we seem to exactly observe the library dependent bias that was mentioned. In principle all points I made above apply even more for quantile normalized FPKM (not peer reviewed combination of two methods, etc.).

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.3 years ago by Michael 55k

0

Entering edit mode

Thanks for sharing this. It helps.

ADD REPLY • link updated 3.2 years ago by Ram 44k • written 10.3 years ago by Manvendra Singh ★ 2.2k

Ram · Answer 2 · 2014-07-04

1

Entering edit mode

10.5 years ago

Ming Tommy Tang ★ 4.5k

If you want to get differentially expressed genes, you need to control the library type as a factor when you use DESeq or limma.

You may also use sva to remove batch effect

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.5 years ago by Ming Tommy Tang ★ 4.5k

Ram · Answer 3 · 2014-07-06

When combining these datasets please take extreme caution with results interpretation and design your analysis in such a way that you exclude bias effect.

For more information about the biases in library prep read this:

Genome Biol. 2014 Jun 30;15(6):R86. [Epub ahead of print] IVT-seq reveals extreme bias in RNA-sequencing.

Lahens NF, Kavakli IH, Zhang R, Hayer K, Black MB, Dueck H, Pizarro A, Kim J, Irizarry R, Thomas RS, Grant GR, Hogenesch JB.