Hi all,
I am digging into some statistical analysis with SingleCellSignalR, which is a tool to predict cellular interaction with single-cell RNA-seq data. I was wondering what the reasonable statistical methods for two-sample test and simple linear regression were for the output variable regularized product. The question arises with this tool, but I think the general principle should apply on this kind of data. So if you had experiences dealing with similar data, please have a look and make some comment.
Let me introduce SingleCellSignalR briefly. It is intended to predict interaction of ligand and receptor between two cell types based on scRNAseq profiles. The regularized product, also called LRscore, is defined as (l * r)^0.5 / (mu+(l * r)^0.5) in the paper SingleCellSignalR: inference of intercellular networks from single-cell transcriptomics. l, r represents expression of ligand of cell type x and receptor of cell type y in single-cell RNA-seq profiles, and mu denotes mean of the matrix, i.e. sum of all items divided by m rows*n columns. Intuitively, the LRscore ranges from 0 to 1. To control false positive, a threshold of 0.5 was determined for the LRscore, according to the paper.
Here comes the question. Let's say I perform scRNA-seq on tumour and adjacent control samples from 8 patients. If one were to do a two-sample test of ligand receptor interaction L_R between tumour and control groups on cell type x and y, what statistical approach would sound reasonable? One may perform Mann-Whitney test or other non-parametric test with the LRscore, but the problem is that it may result in too many false positives. For example, if expression of l and r were below 0.5, even if the L_R interaction was deemed significant by the test, it may still be error. Alternatively, one may suggest converting value of LRscore of < a threshold, say, 0.5, to a small number (cut off), followed by two-sample test. In this case, what would be the appropriate method? As far as I know, Mann-Whitney test is not applicable to bimodal distribution data. Or is there any strategy I can try to tackle the problem?
Another scenario concerns common regression and classification. If one were to perform linear regression of response on LRscore, or classify observations according to LRscores of different interactions, what form of LRscore should be used, as mentioned above?
Any help would be appreciated.
I would recommend this to be posted on the Bioconductor Support site or the Statistics Stackoverflow.
It is a very specific question on the statistical interpretation and validity of the model, rather than the bioinformatics principles.
Thanks for the suggestion.