Hello! I wanted to reach out to the community of experts to get some advice. 2 years ago this Resource paper on lung cancer from the CPTAC consortium came out: https://www.sciencedirect.com/science/article/pii/S0092867420307443#mmc1 , which is really a goldmine of valuable data for the field. We frequently use this kind of resource to generate and validate hypotheses. However, with this particular set, the RNA-seq data reported is in RPKM or z-score instead of raw counts, precluding its use for differential expression analysis (using Deseq2 for example). So my questions to the community are:
1)What would be the appropriate way of comparing RNA-Seq expression data between groups of samples using the RPKM values (if any)? Could I just run a Wilcoxon or t-test on them? I think that this not appropiate, but I just want to know if I can work with the data as is. 2)Does anybody know if there is a way of either asking for the raw counts or generating it from the currently published data? 3)I am sure that there is a good reason for publishing RPKM instead of raw counts, would somebody be so kind as to briefly explain to me its advantages?. If statistics cannot be run, I have a hard time understanding.
Thanks a lot!! I would really appreciate your input!!
Ezequiel
No, there is no good reason to use RPKMs. Period.
Also, I'd actually argue that the Wilcoxon test would be perfectly fine if you have lots of samples. Limma and DESeq2 were designed to handle problems associated with small sample sizes; with larger sample sizes, nonparametric tests tend to perform just as well or better.