Question

Significant gene expression i non-normal distributed gene expression data

0

Entering edit mode

4.6 years ago

jabbari.parnian ▴ 30

Hi everyone

I have a normalized expression data in log2 transformed format. I wanted to find significantly expressed genes from these data using z-score as in the article by [Hart et al] but my data is not normally distributed. I was wondering if there is another method to find the active / significantly expressed genes for each cell (column in my data set)?

Thanks

RNA-Seq DEG • 931 views

ADD COMMENT • link updated 4.6 years ago by i.sudbery 21k • written 4.6 years ago by jabbari.parnian ▴ 30

score 1 · Answer 1 · 2020-05-26

The zFPKM method cited relies on the fact that underlying gene-expression of expressed genes is approximately log-normal (which probably is a good estimate) and that any non-normality is introduced by the sampling distribution of the measurement method, and signal from non-expressed genes.

However, this method is probably not suitable, unmodified for single cell studies, which is sounds like you might be using here? The is becausae the approximation to normal is probably caused by the averaging over many cells (or rather, detecable by averaging over many cells).

genes are expressed according to abstract regulatory level, lets call it lambda. You might like to think of lambda as being the probablility that a promoter fires in a particular time period. In each cell, lambda will vary slightly according to the internal state of the cell, and we can assume that the distribution of underlying lambda's is normal (or log-normal) across the cell population.

Lambda is converted to a read count via a series of poisson processes. When read count is high enough, read counts average over a population and will give you a normal distribution, and so the read count is good enough estimate of lambda. However, in a single cell, and with low read counts, this does not hold.

There might be a way to work though the maths to work out how to estimate lambda from the read counts, but I don't know it. Imputing the missing zeros mnight also help I don't know.

You could have a look at the answers given here: https://bioinformatics.stackexchange.com/questions/687/what-methods-are-available-to-find-a-cutoff-value-for-non-expressed-genes-in-rna/712#712

Paritculalry my answer, which might work in a single cell context.

But in general, the problem with single-cell is that having zero counts is not good evidence of not being expressed, and I'm not sure there is any way around that.