Question

Statistical tests for RPKM comparison

1

Entering edit mode

9.6 years ago

rpjl1230 ▴ 20

Hi, I was wondering if anyone can provide some suggestion as to what statistical tests I can use for differential gene expression with RPKM values.

I am stuck with RPKM as DESeq or edgeR is not an option for me unfortunately. I have generated some RNA-Seq data from a dozen tissue samples collected from patients with 2 disease phenotypes (n=6 for each group) and I would like to compare my result to the published RNA-Seq data using in vitro systems. Unfortunately the authors of the published paper only provided a table on RPKM values (and singleton as well for each of the tested conditions!). They also did not deposit any of the fastq files into any database so I cannot even re-do the alignment.

In this scenario where I have no raw counts on the published data and uneven group size, is there anyway that I can still reliably compare my data to the existing ones and do differential expression analysis? Any suggestion will be very helpful!

Many thanks!

RPKM differential-gene-expression • 6.2k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.6 years ago by rpjl1230 ▴ 20

0

Entering edit mode

RPKMs are terrible for statistics. Would it be possible to just analyse your samples and compare the resulting fold-changes/DE genes to those from the published study? That might yield nice results.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

I thought about doing FC as well but have another dilemma. What would you use as a cut-off? Presumably I will have to use some arbitrary cut-offs, e.g. if >2 log2FC then highly abundant. Is there any way to make it more objective?

Also, to do FC, will you:

Just a standard FC calculation of my data relative to the published data? Or
Calculate FC from the median RPKM for each sample and then compare the perturbation of my patient data to the in vitro data? I thought about this because of the patient samples have pretty high intragroup variance (some samples have much higher/lower median RPKM than others within the same group) and I am not sure how to "normalize" properly?

Thanks again!

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by rpjl1230 ▴ 20

0

Entering edit mode

I am facing a similar situation with data on GEO only being present in log2 FPKM. After you converted to TPM, how did you end up testing for differential expression?

Thanks!

ADD REPLY • link 8.7 years ago by acorella ▴ 30

Ram · Answer 1 · 2015-05-11

This is a very good article about the pros and cons of doing stats on FPKM, and how to calculate TPM, which is by all means a better stat when you don't have much else to work with.

https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

There are plenty of others, but that link also links to all the good ones too :)

But to answer the higher-level question of not "how can you" but "why should you" do differential analysis, consider the following sources of noise:

your technical "hands" verses theirs (experimental noise)
in vivo VS in vitro (biological noise - this will most likely be huge)
differences in sequencing technology (paired vs single)
differences in data quality and total data (read depth, better mapping, etc)
all the statistical noise inherent in going in to an analysis 'blind' without looking at a specific region/difference to begin with.

We've all got to prioritise time, and I think, personally if it were me, I would either do a very rudimentary analysis to get a ballpark figure for overall correlation, or, not even bothering to download the external data in the first place, and focus my efforts on doing the very best/cleanest analysis with, what sounds like (n=6) some pretty good data by RNA-Seq standards :)