N.B., while I was writing this, you replied with a link to a paper doing something completely different, so this isn't likely relevant.
As seidel said, context is important.
In these studies, the goal is prenatal diagnosis of an autosomal recessive disorder. In short, a certain amount of fetal DNA is commonly found in the circulatory system of the mother (the amount changes over the duration of the pregnancy). Since the mother, in this case, is a carrier for the disorder, we need to somehow estimate the probability that the a possible over-abundance of a mutant allele observed in her blood is significant, since the overabundance would come from a homozygous mutant child.
So, firstly, these are not RPKMs, but raw counts (at least in the ones I've seen, namely here). r
, then, is the ratio of fetal to maternal DNA that we get from the blood draw. We measure some total number of reads informatively covering a site, Nt
, comprised of mutant (Nm
) and wild-type (Nw
) reads. If the fetus is homozygous mutant, then we'll have an Nm-Nw=r*Nt
, since the entire fetal contribution will be in Nm
. If, on the other hand, the fetus is heterozygous, then reads originating from it will be (more or less) evenly split between the two counts, and Nm-Nw=0
. We can measure Nm-Nw, but in order to get a statistic, we need to make that a Z-score, by dividing by its standard deviation.
So how, then, might we arrive at a standard deviation. Recall that we have raw counts (not RPKMs), which means that the technical variance is the same as the mean (we have Poisson variance). The standard deviation is the square root of the variance, so:
sd(Nm) = sqrt(E(Nm)) = sqrt(0.5*Nt+0.5*r)
in the homozygous case and sqrt(0.5*Nt)
(E(Nm)
is the expected value and I'll leave demonstrating that (Nt+r)/2
is the expected value as an exercise for you).
sd(Nw) = sqrt(E(Nw)) = sqrt(0.5*Nt-0.5*r)
in the homozygous case and also sqrt(0.5*Nt)
in the heterozygous case.
We then want the pooled deviation of the difference, which we can get by simply taking the square root of the pooled variance. The pooled variance ends up just being Nt
, since you just take the sum of the variances (i.e., the sum of the squared standard deviations). Thus, the denominator is sqrt(Nt)
.
Update: As mentioned at the top, it seems that you're talking about different papers (thanks for the links) where they actually do use RPKMs. The logic is generally the same as that presented above and it should be noted that the denominator should be sqrt(RPKM_A + r*RPKM_B)
, since that's vaguely similar to sqrt(Nt)
. This is, however, a terrible way of doing things and should never be done. They were doing pair-wise comparisons, but one should really take a couple different dishes of ES cells with difference passages and then use them as replicates to actually gauge what the variance is.
Update2: By "Poisson distribution", they mean that the RPKM measurements themselves have Poisson variance. This is simply because the technical variance of the original counts are Poisson and an RPKM is just a transformation of the original count.
maybe it's a typo?
So, probably "+" is correct?
I don't know, but in few similar situations, I have sent emails to authors only to get a response about typos and writing errors. As you have said, the SQRT from a negative number would be one with an imaginary number, will that work?
Using RPKM/FPKM rather confounds than aids analysis of the biological 'truth', I do not think that any further scaling can alleviate this. See also Does FPKM scale incorrectly in case of unequal mapping rates?