Question

public RNAseq data : compare gene expression using RPKM

0

Entering edit mode

6.4 years ago

dle • 0

Hello,

This is my first post in this forum and I thank you by advance for your precious help.

I am dealing with public RNAseq dataset, that only provide RPKM values.

I am interested in comparing few genes (4) expression according to 3 groups, with many replicates.

I would have liked to use raw data count with edgeR... Instead of this approach, could I use something like an anova or a linear model to compare genes expression using RPKM values?

Thanks in advance

david

RNA-Seq anova RPKM • 2.5k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 6.4 years ago by dle • 0

1

Entering edit mode

Please use google and the search function as well as pubmed for opinions on the suitability of RPKM for differential gene expression. This has been discussed many times before. The short answer is: it is not recommended anymore as it performs poorly on normalizing when library composition is different. Use raw counts with an established pipeline such edgeR or DESeq2.

ADD REPLY • link 6.4 years ago by ATpoint 88k

0

Entering edit mode

I agree, I recommend paper by Anders et al. for pipeline

ADD REPLY • link 6.4 years ago by boczniak767 ▴ 880

0

Entering edit mode

This one is rather old. DESeq has been superseded by DESeq2. See a possible workflow here.

ADD REPLY • link 6.4 years ago by ATpoint 88k

0

Entering edit mode

Thank you very much for your answer. I am aware this method would be a last resort solution but my input data only contained RPKM values (as many public datasets unfortunately...)

ADD REPLY • link 6.4 years ago by dle • 0

0

Entering edit mode

You can download raw data from most public datasets and analyze from scratch. Where are the data from?

ADD REPLY • link 6.4 years ago by ATpoint 88k

0

Entering edit mode

I would recommend taking some time to research options before a Biostars post, although I admittedly think you also need first-hand experience (beyond just what you can read in the literature). So, I think it also important to have your own comparison with your own data to make a decision about the suitability of using FPKM expression for analysis.

So, please do try the count-based methods, but also please take your time to critically assess your dataset to understand it (and the methods) better. You might even discover some sort of novel strategy that should increase the impact of your associated paper :)

Also, independent of the differential expression step, you may want to use log2(FPKM + 0.1) values for things like GSEA or BD-Func enrichment (as well as QC plots, like hierarchical clustering or PCA plots). I also find having the direct expression useful as sort of validation to visually inspect the differential expression results (such as in a heatmap).

ADD REPLY • link 6.4 years ago by Charles Warden 8.3k

score 0 · Answer 1 · 2019-02-26

There can be some situations where using FPKM values for statistical analysis might be useful. However, that strategy tends to be more conservative, and I usually recommend testing edgeR / DESeq2 / limma-voom in a benchmark for "initial" analysis for every project.

I hope to eventually be able to put together a newer paper to explain my point more clearly (about needing to test different freely available programs, and not being able to lock-down a particular analysis strategy for all projects), but that won't be in the immediate future.

So, in the meantime, perhaps these are some things that are worth checking out:

A: FPKM or rlog from DEseq2 for machine learning analysis

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

(in this case, there was a situation where DESeq1 was more conservative than ANOVA, but you can get some idea about the performance in that particular dataset, even though I expect the results will vary for each project, and that is why you get different recommendations with various benchmark papers).