Hi all,
I have RPKM values for a single sample (lung adenocarcinoma) and wish to compare it to RPKM values for a group of controls (50 TCGA normal lung samples).
Bearing in mind the one to many nature of the analysis, and RPKMs being the starting point, can someone recommend the best method/software for calculating differential expression with some appropriate measures of significance? At its most basic I have calculated fold changes and Z-scores (mean and median based) but I am guessing this is overly simplistic.
All help appreciated.
I had assumed it would not be safe to take raw counts from different sources/centers and attempt differential expression analysis. Do both DESeq and edgeR attempt to correct for issues like differences in sequencing depth?
Different library sizes (due to both different sequencing depth and different ratio of mappable reads) are exactly the raison d'ĂȘtre for these approaches. There's several papers explaining why RPKM is not appropriately dealing with that. See e.g. Differential Gene Expression Analysis - Rpkm Vs Readcount and Rnaseq Differential Expression. About RPKM inconsistencies, you can have a starting look with this blog post.
Furthermore, if you suspect there's some batch effects (e.g. a lab effect for samples coming from different centers), linear modeling in edgeR can help you to correct/account for this. There's a large scale RNA-sequencing effort that got a study published recently and that adequately dealt with batch effects. If that's interesting for you, you could start browsing from the GEUVADIS RNA-Seq website.