Question

DE Analysis starting from TPM matrix? Also, no replicates.

0

Entering edit mode

6.6 years ago

basch • 0

I have a matrix of TPM values from 27 different tissue types that I obtained from a database (thus, I don't have the read counts). The data comes from an RNA-seq experiment.

I want to make differential expression analysis, where the purpose is to find a set of genes that is specific for one of those tissue types.

Is this possible? I've used DESeq2 before but that starts from read counts, not TPM values. Since I am working with pig, there are not many available databases from which I can extract these markers genes.

Thank you in advance,

tpm RNA-Seq • 3.9k views

ADD COMMENT • link updated 6.6 years ago by Michael 55k • written 6.6 years ago by basch • 0

1

Entering edit mode

6.6 years ago

Michael 55k

Your first priority should be getting the raw data. You write that you got it from 'a database'. If the data extracted from that database is based on published data, then it should be possible to get the raw data also, and normally they will be replicated. E.g. when retrieving summarized tissue expression from Expression Atlas, there is always a link to the original datasets and publications.

ADD COMMENT • link 6.6 years ago by Michael 55k

score 4 · Accepted Answer · 2018-11-20

Most tools expect raw counts as you mention. Without replicates, analysis will in any case be explorative but not statistically sound. You can take the log2 fold changes to get an idea what genes might be involved (probably a decent TPM cutoff to avoid high enrichment due to small counts, "mean-variance-relationship") makes sense. Still, any result will be unreliable so be careful to build downstream experiments on such analysis.

score 3 · Accepted Answer · 2018-11-20

Given the data you have, I don't see much chance of doing a DE analysis. DESeq and edgeR both require counts. It might be possible to do some thing using limma with TPMs, but without replicates, you are going to struggle to get anything meaningful.

Instead, I you'd be better not thinking aobut your problem as a differential expression problem.

There are various approaches to identifying tissue specific genes, in fact I believe I saw something on bioaxiv recently, but a simple approach might be an outlier anlaysis.

First normalise your data. An obvious approach might be be rlog or vst.

Then for each gene calculate the mean and standard deviation of all the tissues except your tissue of interest. Then calculate a Z score for the expression of the gene in the tissue of interest using this. Convert this to a p-value using the normal distribution. You'll need to do some corrections. Ideally you'd do some sort of empirical FDR, but I can't quite think how right now. You might get away just with doing a BH correction on the p-value.