Hello,
This is my first post in this forum and I thank you by advance for your precious help.
I am dealing with public RNAseq dataset, that only provide RPKM values.
I am interested in comparing few genes (4) expression according to 3 groups, with many replicates.
I would have liked to use raw data count with edgeR... Instead of this approach, could I use something like an anova or a linear model to compare genes expression using RPKM values?
Thanks in advance
david
Please use google and the search function as well as pubmed for opinions on the suitability of RPKM for differential gene expression. This has been discussed many times before. The short answer is: it is not recommended anymore as it performs poorly on normalizing when library composition is different. Use raw counts with an established pipeline such edgeR or DESeq2.
I agree, I recommend paper by Anders et al. for pipeline
This one is rather old. DESeq has been superseded by DESeq2. See a possible workflow here.
Thank you very much for your answer. I am aware this method would be a last resort solution but my input data only contained RPKM values (as many public datasets unfortunately...)
You can download raw data from most public datasets and analyze from scratch. Where are the data from?
I would recommend taking some time to research options before a Biostars post, although I admittedly think you also need first-hand experience (beyond just what you can read in the literature). So, I think it also important to have your own comparison with your own data to make a decision about the suitability of using FPKM expression for analysis.
So, please do try the count-based methods, but also please take your time to critically assess your dataset to understand it (and the methods) better. You might even discover some sort of novel strategy that should increase the impact of your associated paper :)
Also, independent of the differential expression step, you may want to use log2(FPKM + 0.1) values for things like GSEA or BD-Func enrichment (as well as QC plots, like hierarchical clustering or PCA plots). I also find having the direct expression useful as sort of validation to visually inspect the differential expression results (such as in a heatmap).