The question is simple. Can I use a Negative Binomial regression to model Transcripts Per Million (TPM) as the outcome/response variable? It is usually used on raw count data, but I am interested in TPM, as I need to normalize for gene length (but not interested in DE analysis).
After log transformation, TPMs should follow fairly well a normal distribution so you may just use standard linear regression. I don't think negative binomial would be a good choice as it is best suited to count data. Maybe a GLM with gamma distribution would be more appropriate. Alternatively, I wonder if you could/should use negative binomial on the raw counts and use gene length and library size as covariates. The advantage of this last option is that you correctly account for the fact that low expressed genes are measured less reliably due to lower number of reads mapping to them.
You might want to add some sort of variance stabalising transform, but a straight log TPM is unlikely to show constant variance at low read counts. But Negative Bionomial is definitely not suitable.
VST on raw counts wouldn't account for gene length, so wouldn't be useful if you needed to compare between genes. I was thinking of some sort of variance stabilisation on the TPMs. Log itself stabilises the variance to an extent. Regularized log also. (that is log (x+a)).
The VST transformation, as implemented in DESeq(2) is just one example of a variance stabilizing transform. A variance stabilizing transform is any transform that stabalises variance across values of the mean.
The VST transform in DESeq when fittype="parametric" assumes an NB distribution, but I don't think thats true if fitype="local".
The correct model though is probably NB with two offsets, one for normalization factor, and one for gene length.
Ok I see, thanks. One last thing about this normalization factor (which accounts for library size), I think it would not make much sense in my model, as what I am doing is:
So I obtain a Beta coefficient for each individual, which is what I need.
If I add a second offset as you mention, it will be the same value for all transcripts (as we are inside the same individual, the normalization factor is the same). I see the point with the first gene length offset, but not on the second one.
And about the VST, I am not sure if using VST on raw counts will bias the NB regression.
Hi dariober, thanks for the answer. I was thinking on this alternative method you mentioned, but using gene length as an offset (not as a covariate). I have read other people using this Finding the best model for expression data (RNA-Seq counts)
"Another alternative would be to use negative binomial regression, but
including the gene length as an "offset". An "offset" is an
explanatory variable whose coefficient is forced to be 1, and is
generally used as a normaliser. In poisson-like models (including nb),
it acts like a divisor to normalise for gene length (this is how both
DESeq2 and edgeR deal with library depth in their statistical
models)."
However, they do not talk about adding library size aswell. What do you think should I do? Is only gene length ok?
You might want to add some sort of variance stabalising transform, but a straight log TPM is unlikely to show constant variance at low read counts. But Negative Bionomial is definitely not suitable.
Hi i.sudbery, thanks. Do you mean using VST normalization into the raw counts data (not TPM) and then using the NB regression?
VST on raw counts wouldn't account for gene length, so wouldn't be useful if you needed to compare between genes. I was thinking of some sort of variance stabilisation on the TPMs. Log itself stabilises the variance to an extent. Regularized log also. (that is
log (x+a)
).I thought you should not use VST on the TPMs. So my two alternatives would be:
1) NB with raw counts + offset of gene length
2) NB? with VST on TPM counts
The VST transformation, as implemented in DESeq(2) is just one example of a variance stabilizing transform. A variance stabilizing transform is any transform that stabalises variance across values of the mean.
The VST transform in DESeq when fittype="parametric" assumes an NB distribution, but I don't think thats true if fitype="local".
The correct model though is probably NB with two offsets, one for normalization factor, and one for gene length.
Ok I see, thanks. One last thing about this normalization factor (which accounts for library size), I think it would not make much sense in my model, as what I am doing is:
NB regression for each sample -->
So I obtain a Beta coefficient for each individual, which is what I need. If I add a second offset as you mention, it will be the same value for all transcripts (as we are inside the same individual, the normalization factor is the same). I see the point with the first gene length offset, but not on the second one.
And about the VST, I am not sure if using VST on raw counts will bias the NB regression.
Yes. I would not use VST if you were going to use NB regression, but rather if you were going to do linear regression on log TPMs.
Ok! Sorry and about the first question of the second offset?
yeah. Not necessary if you are only going to use one library.
Fine! thank you very much for your help, I really appreciate it :D
Hi dariober, thanks for the answer. I was thinking on this alternative method you mentioned, but using gene length as an offset (not as a covariate). I have read other people using this Finding the best model for expression data (RNA-Seq counts)
However, they do not talk about adding library size aswell. What do you think should I do? Is only gene length ok?