Question

Seurat: Log fold change and p value when control expression is 0

0

Entering edit mode

17 months ago

psm ▴ 130

Simple question, sorry if it is obvious, but I was unable to find an answer to this exact question.

I have a single cell RNAseq dataset and I'm performing differential gene expression at the cluster level, comparing transcript expression based on experimentally defined treatments (+/- IFNg). One gene in particular, CIITA, is literally absent from the dataset in the control condition, but detectable in a sizeable subset of IFNg-treated cells, in all clusters.

Differential expression using both Seurat and DESeq2 give me p values and and log2Fold Changes, which is significant for some clusters, and insignificant for others, despite obvious upregulation of CIITA in a subset of cells in each IFNg-treated cluster (ranging from 5-20% non-zero expression per cluster).

My question is - when one group starts at zero, and the comparison group is non-zero, are these statistical tests valid? Clearly the log2 fold change is meaningless, as 0 expression to anything should be infinite. I'm guessing the fact that a number can even be returned for LogFoldChange reflects the offsetting of counts by a small value to eliminate Log(0) errors. But this offset probably also influenced the p value.

Any thoughts on how I can proceed? Are there any packages/methods that address this issue? Is it even meaningful to compare genes where expression is completely absent from the control group?

Many thanks

RNA-seq DGE • 1.7k views

ADD COMMENT • link updated 17 months ago by LauferVA 4.5k • written 17 months ago by psm ▴ 130

1

Entering edit mode

The MAplot may help you identify such lowly expressed genes with big log2Fold changes.

see also https://support.bioconductor.org/p/108491/

ADD REPLY • link 17 months ago by Ming Tommy Tang ★ 4.5k

score 1 · Answer 1 · 2023-06-26

It entirely depends on what exactly you test, which flows from your understanding of the biology. Consider attempting to calculate a p-value for your data using a two sample T-test. Recall the variance of each sample can be written as: is written like so .

Every term in the numerator of the sum in this expression will be 0, thus the variance is also 0. As a result, if one proceeded with a two sample t-test, eventually the ratio of the sample mean to sample variance would need to be calculated for the IFNg naive group, which would be 0/0 (undefined).

So, instead, you must use another approach to calculate a test statistic. For expression data, many heuristics, work-arounds, and meta-analytic techniques are used to generate stable variance estimates, which is a necessity due to the combination of high variability and (in many cases) low numbers of samples (e.g. in bulk-seq). For instance, you could use the values of other genes in the study with similar characteristics to CIITA to estimate what the true dispersion of CIITA in the IFNg naive cluster is. Reading the DESeq manuscript, for instance, could help you understand this in greater detail.

More practically, generating scRNA significant p-values for expression results in single cell datasets is not difficult to do... As such, the importance of the technique used to get a p-value is of less importance in particular if visualization implies profoundly different expression between the two groups. So, choosing a simple workaround would be to use a 1 sample t test can be considered. Here, you would be assuming that the dispersion in the IFNg naive and treated groups is similar, and test the likelihood of the mean difference according to this assumption (about the variances of the two groups). Based on the description, it seems like this would be a conservative approach, since the true variance of the IFNg naive group is likely to be much smaller than that of the IFNg-treated group (granted it wasn't detected in ANY sample).