Hello, I am trying to perform a t-test in python to estimate if the expression of a gene is statistically different between two clusters of cells in an scRNA-seq experiment.
When I do the following:
from scipy import stats as st
st.ttest_ind(counts[gene_1], counts[gene_2])
I get a p-value for some of my comparisons. In others, however, I get the following:
Ttest_indResult(statistic=nan, pvalue=nan)
A nan instead of a p-value.
Please correct me if I am wrong, but I think a t-test requires the data to be normally distributed. I have log-transformed my gene counts, which should help with that, but I still get a very zero-inflated distribution. Find below histograms of the expression of a gene in two clusters of cells for which a t-test returns a nan value:
And here is a histogram of the expression of another gene in two clusters of cells for which a t-test returns a valid p-value:
It looks like that the reason is the low expression level of the cells that express some of the gene. But I don't understand why this leads to a nan p-value.
I have a number of questions: Am I doing right using a t-test in this case? Is there another test I should be using instead? What should I do with these nan p-values, or how should I avoid them? Any help understanding what is going on would be so useful.
Thanks so much and my apologies in advance for my poor knowledge in statistics.
I would suggest that you always try to use wrappers to run statistical tests, I am pretty sure that in Python something like Scanpy has differential testing routines that you can apply. The same goes for Seurat and Bioconductor. It is also not really meaningful to test a single gene but rather test all genes and then see whether your gene survival the FDR correction. After all (sc)RNA-seq is an assay to probe all genes and not just a single one and as such measurements are not independent and it is probably not appropriate to just t-test single genes. Check Scanby if you're in python and do not implement testing yourself, too many caveats.