Question

Differential expression analysis using Lung cancer CPTAC RNA SEQ Data

0

Entering edit mode

3.1 years ago

Ezequiel • 0

Hello! I wanted to reach out to the community of experts to get some advice. 2 years ago this Resource paper on lung cancer from the CPTAC consortium came out: https://www.sciencedirect.com/science/article/pii/S0092867420307443#mmc1 , which is really a goldmine of valuable data for the field. We frequently use this kind of resource to generate and validate hypotheses. However, with this particular set, the RNA-seq data reported is in RPKM or z-score instead of raw counts, precluding its use for differential expression analysis (using Deseq2 for example). So my questions to the community are:

1)What would be the appropriate way of comparing RNA-Seq expression data between groups of samples using the RPKM values (if any)? Could I just run a Wilcoxon or t-test on them? I think that this not appropiate, but I just want to know if I can work with the data as is. 2)Does anybody know if there is a way of either asking for the raw counts or generating it from the currently published data? 3)I am sure that there is a good reason for publishing RPKM instead of raw counts, would somebody be so kind as to briefly explain to me its advantages?. If statistics cannot be run, I have a hard time understanding.

Thanks a lot!! I would really appreciate your input!!

Ezequiel

CPTAC RNA-SEQ Deseq2 PROTEOMICS RPKM • 1.3k views

ADD COMMENT • link updated 3.1 years ago by dsull ★ 7.6k • written 3.1 years ago by Ezequiel • 0

0

Entering edit mode

No, there is no good reason to use RPKMs. Period.

ADD REPLY • link 3.1 years ago by dsull ★ 7.6k

0

Entering edit mode

Also, I'd actually argue that the Wilcoxon test would be perfectly fine if you have lots of samples. Limma and DESeq2 were designed to handle problems associated with small sample sizes; with larger sample sizes, nonparametric tests tend to perform just as well or better.

ADD REPLY • link 3.1 years ago by dsull ★ 7.6k

score 2 · Answer 1 · 2022-05-12

You can download the count data from GDC and you can apply your favorite downstream count-based analysis on that (DESeq2, limma-voom, etc.). The GDC website is comprehensive and takes some time to understand how to navigate, but here's the search to get you started: GDC CPTAC lung counts

You'll have to learn some of GDC nomenclature/mapping and merge the individual samples into a single count matrix, but definitely worth learning to mine data directly from the GDC.

score 1 · Answer 2 · 2022-05-12

You could try Wilcox tests if the sample size is really large but this is suboptimal. Limma-trend on the RPKMs is theoretically possible but not recommended because the length division in RPKM distorts properly modelling the mean variance relationship that this entire Empirical Bayes concept of limma is based on. If you and your lab really use these data often then why not applying for access, download the data and process properly. I mean, you only have to do that once and then you have your matrix of raw counts and then can re-use it in whatever scenario you want for years.

Edit: See answer by @seancho -- raw counts seem to be available.