Hi,
I'm still a bit new to computational analysis of single cell data, but I'm doing my best to understand why things are done.
As I understand it, when people cluster their data, typically they do some feature selection, usually taking the most variable genes across the entire dataset or through iteratively subsetting their data and doing this with each subset. The expression of these variable genes is then fit to a negative binomial distribution to estimate scaled expression values, which will then be fed into dimensional reduction and/or clustering algorithms.
I'm having a difficult time trying to understand what the purpose of fitting to a negative binomial distribution is. Is it that this takes into account relative abundances better? Please tell me if I'm on the right track or way off:
Say gene A is expressed lowly in most cells -- it's 1 or 2 copies in some cells, but relatively highly at 5 copies in a few cells. Gene B is comparatively expressed much higher -- several cells express it at 10-20 copies, while other cells express it relatively highly at 50 copies per cell.
So this fitting to a negative binomial distribution in essence helps take into account the nature of the expression of Gene X to provide a normalized, scaled, and centered value of 2 or 3 for both of these genes, despite the differences in overall expression? And its fit to a negative binomial distribution because gene expression follows this distribution? I've heard this but don't know what paper showed this.
I'd appreciate any explanations or links that might clarify this more.
Thanks,
Eric
what is the mechanistic explanation of why negative binomial is physically plausible? for binomial processes (negative or not), we have the notion of (a) a success rate, (b) the number of trials, and (c) the number of successes. In RNA seq, what are the trials and successes? Just wondering how, when negative binomial is giving us a probability dist of "number of trials given a number of successes," what are those quantities biologically? eg, is a gene read a "success"?
i ask this because the math here is easy to get, but why this model makes sense seems totally arbitrary and lost on me lol
I would ask this to a statistician
I think I have a better grasp of why negative binomial is used for modeling true counts, after a few nights sleep.
Assume that scRNA-seq reads are a Bernoulli process with probability p. Specifically, each transcript in a cell is like a trial, and each read is a success of that trial. And these successes happen with some probability p. Then, the number of reads k is binomially distributed by k ~ Bin(n, p). We're interested in the inverse, i.e., the distribution of n given k and p. This is precisely the negative binomial distribution: n ~ NegBin(k, p).
I think this is the correct way to motivate the negative binomial. If anyone can correct me, please do!!
My only additional question, then, is how do we know what p is? Is it inferred from gene length and sequencing depth? I'm not sure.