public RNAseq data : compare gene expression using RPKM
1
0
Entering edit mode
5.8 years ago
dle • 0

Hello,

This is my first post in this forum and I thank you by advance for your precious help.

I am dealing with public RNAseq dataset, that only provide RPKM values.

I am interested in comparing few genes (4) expression according to 3 groups, with many replicates.

I would have liked to use raw data count with edgeR... Instead of this approach, could I use something like an anova or a linear model to compare genes expression using RPKM values?

Thanks in advance

david

RNA-Seq anova RPKM • 2.1k views
ADD COMMENT
1
Entering edit mode

Please use google and the search function as well as pubmed for opinions on the suitability of RPKM for differential gene expression. This has been discussed many times before. The short answer is: it is not recommended anymore as it performs poorly on normalizing when library composition is different. Use raw counts with an established pipeline such edgeR or DESeq2.

ADD REPLY
0
Entering edit mode

I agree, I recommend paper by Anders et al. for pipeline

ADD REPLY
0
Entering edit mode

This one is rather old. DESeq has been superseded by DESeq2. See a possible workflow here.

ADD REPLY
0
Entering edit mode

Thank you very much for your answer. I am aware this method would be a last resort solution but my input data only contained RPKM values (as many public datasets unfortunately...)

ADD REPLY
0
Entering edit mode

You can download raw data from most public datasets and analyze from scratch. Where are the data from?

ADD REPLY
0
Entering edit mode

I would recommend taking some time to research options before a Biostars post, although I admittedly think you also need first-hand experience (beyond just what you can read in the literature). So, I think it also important to have your own comparison with your own data to make a decision about the suitability of using FPKM expression for analysis.

So, please do try the count-based methods, but also please take your time to critically assess your dataset to understand it (and the methods) better. You might even discover some sort of novel strategy that should increase the impact of your associated paper :)

Also, independent of the differential expression step, you may want to use log2(FPKM + 0.1) values for things like GSEA or BD-Func enrichment (as well as QC plots, like hierarchical clustering or PCA plots). I also find having the direct expression useful as sort of validation to visually inspect the differential expression results (such as in a heatmap).

ADD REPLY
0
Entering edit mode
5.8 years ago

There can be some situations where using FPKM values for statistical analysis might be useful. However, that strategy tends to be more conservative, and I usually recommend testing edgeR / DESeq2 / limma-voom in a benchmark for "initial" analysis for every project.

I hope to eventually be able to put together a newer paper to explain my point more clearly (about needing to test different freely available programs, and not being able to lock-down a particular analysis strategy for all projects), but that won't be in the immediate future.

So, in the meantime, perhaps these are some things that are worth checking out:

A: FPKM or rlog from DEseq2 for machine learning analysis

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

(in this case, there was a situation where DESeq1 was more conservative than ANOVA, but you can get some idea about the performance in that particular dataset, even though I expect the results will vary for each project, and that is why you get different recommendations with various benchmark papers).

ADD COMMENT

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6