Question

Can I use Negative Binomial regression to model TPM counts?

1

Entering edit mode

4.1 years ago

guillepalou4 ▴ 20

Hi,

The question is simple. Can I use a Negative Binomial regression to model Transcripts Per Million (TPM) as the outcome/response variable? It is usually used on raw count data, but I am interested in TPM, as I need to normalize for gene length (but not interested in DE analysis).

Thank you

Guille

RNA-Seq Negative Binomial TPM • 2.5k views

ADD COMMENT • link updated 4.0 years ago by dariober 15k • written 4.1 years ago by guillepalou4 ▴ 20

ATpoint · Answer 1 · 2020-11-05

2

Entering edit mode

4.0 years ago

dariober 15k

After log transformation, TPMs should follow fairly well a normal distribution so you may just use standard linear regression. I don't think negative binomial would be a good choice as it is best suited to count data. Maybe a GLM with gamma distribution would be more appropriate. Alternatively, I wonder if you could/should use negative binomial on the raw counts and use gene length and library size as covariates. The advantage of this last option is that you correctly account for the fact that low expressed genes are measured less reliably due to lower number of reads mapping to them.

ADD COMMENT • link 4.0 years ago by dariober 15k

2

Entering edit mode

You might want to add some sort of variance stabalising transform, but a straight log TPM is unlikely to show constant variance at low read counts. But Negative Bionomial is definitely not suitable.

ADD REPLY • link 4.0 years ago by i.sudbery 20k

0

Entering edit mode

Hi i.sudbery, thanks. Do you mean using VST normalization into the raw counts data (not TPM) and then using the NB regression?

ADD REPLY • link 4.0 years ago by guillepalou4 ▴ 20

1

Entering edit mode

VST on raw counts wouldn't account for gene length, so wouldn't be useful if you needed to compare between genes. I was thinking of some sort of variance stabilisation on the TPMs. Log itself stabilises the variance to an extent. Regularized log also. (that is log (x+a)).

ADD REPLY • link 4.0 years ago by i.sudbery 20k

0

Entering edit mode

I thought you should not use VST on the TPMs. So my two alternatives would be:

1) NB with raw counts + offset of gene length

2) NB? with VST on TPM counts

ADD REPLY • link 4.0 years ago by guillepalou4 ▴ 20

2

Entering edit mode

The VST transformation, as implemented in DESeq(2) is just one example of a variance stabilizing transform. A variance stabilizing transform is any transform that stabalises variance across values of the mean.

The VST transform in DESeq when fittype="parametric" assumes an NB distribution, but I don't think thats true if fitype="local".

The correct model though is probably NB with two offsets, one for normalization factor, and one for gene length.

ADD REPLY • link 4.0 years ago by i.sudbery 20k

0

Entering edit mode

Ok I see, thanks. One last thing about this normalization factor (which accounts for library size), I think it would not make much sense in my model, as what I am doing is:

NB regression for each sample -->

gene exp (70.000 transcripts) ~ variable_interest + offset(log(transcript_length))

So I obtain a Beta coefficient for each individual, which is what I need. If I add a second offset as you mention, it will be the same value for all transcripts (as we are inside the same individual, the normalization factor is the same). I see the point with the first gene length offset, but not on the second one.

And about the VST, I am not sure if using VST on raw counts will bias the NB regression.

ADD REPLY • link 4.0 years ago by guillepalou4 ▴ 20

1

Entering edit mode

Yes. I would not use VST if you were going to use NB regression, but rather if you were going to do linear regression on log TPMs.

ADD REPLY • link 4.0 years ago by i.sudbery 20k

0

Entering edit mode

Ok! Sorry and about the first question of the second offset?

ADD REPLY • link 4.0 years ago by guillepalou4 ▴ 20

1

Entering edit mode

yeah. Not necessary if you are only going to use one library.

ADD REPLY • link 4.0 years ago by i.sudbery 20k

0

Entering edit mode

Fine! thank you very much for your help, I really appreciate it :D

ADD REPLY • link 4.0 years ago by guillepalou4 ▴ 20

0

Entering edit mode

Hi dariober, thanks for the answer. I was thinking on this alternative method you mentioned, but using gene length as an offset (not as a covariate). I have read other people using this Finding the best model for expression data (RNA-Seq counts)

"Another alternative would be to use negative binomial regression, but including the gene length as an "offset". An "offset" is an explanatory variable whose coefficient is forced to be 1, and is generally used as a normaliser. In poisson-like models (including nb), it acts like a divisor to normalise for gene length (this is how both DESeq2 and edgeR deal with library depth in their statistical models)."

However, they do not talk about adding library size aswell. What do you think should I do? Is only gene length ok?

ADD REPLY • link updated 4.0 years ago by ATpoint 85k • written 4.0 years ago by guillepalou4 ▴ 20