Hi,
I am analyzing RNA seq data (counts) and trying to study the impact of conservation on gene expression. I have a data frame with one row per gene, and one column for gene expression called "data$counts" (a vector of length 10,000, with a value for each gene) and another column for gene conservation called "data$var1" (another vector of the same length, 10,000). I would like to identify the variance explained in gene expression by var1, using regression models, but I can't tell which model fits the data better.
I found many links on here suggesting to use un-transformed count data (not FPKM) and to use a negative binomial distribution (model1).
I also saw suggestions to transform the data to make it normal, and run linear regression. So I tried that too (model2 and model3). However when I plot the histogram none of the distributions look normal:
hist(data$counts) # very skewed
hist(log(data$counts)) # very skewed
hist(log(data$counts+1)) # creates two curves/distributions
Since I can't tell with visual inspection (nothing seems to fit a normal distribution), is there a way to tell which model fits the data better using e.g. AIC or likelihoods from the regressions fitted below, or in any other ways?
Model 1:
summary(glm.nb(data$counts ~ data$var1))
Call:
glm.nb(formula = data$counts ~ data$var1, init.theta = 0.6950337825,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1858 -1.0122 -0.5362 -0.0004 13.6390
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.90055 0.01359 507.839 <2e-16 ***
data$var1 0.01474 0.01359 1.085 0.278
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.695) family taken to be 1)
Null deviance: 9431.6 on 7797 degrees of freedom
Residual deviance: 9430.4 on 7796 degrees of freedom
AIC: 122463
Number of Fisher Scoring iterations: 1
Theta: 0.69503
Std. Err.: 0.00959
2 x log-likelihood: -122457.34300
Model 2:
Call:
glm(formula = log(data$counts + 1) ~ data$var1, family = gaussian)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.3919 -0.6436 0.1400 0.8445 5.6291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.04901 0.01604 377.12 <2e-16 ***
data$var1 -0.21405 0.01604 -13.34 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 2.006264)
Null deviance: 15998 on 7797 degrees of freedom
Residual deviance: 15641 on 7796 degrees of freedom
AIC: 27563
Number of Fisher Scoring iterations: 2
Model 3:
Call:
glm.nb(formula = data$counts ~ data$var1, init.theta = 0.7290633088,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3386 -1.0042 -0.5105 0.0336 12.7096
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.85552 0.01327 516.70 <2e-16 ***
data$var1 -0.29469 0.01327 -22.21 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.7291) family taken to be 1)
Null deviance: 9886.8 on 7797 degrees of freedom
Residual deviance: 9374.1 on 7796 degrees of freedom
AIC: 121964
Number of Fisher Scoring iterations: 1
Theta: 0.7291
Std. Err.: 0.0101
2 x log-likelihood: -121957.8830
Thank you i.sudbery! That is very clear. Yes, I was actually including gene length as covariate to control for that, so doing: expression ~ conservation + gene_length. I don't know how that would be different but will read the paper you linked. I thought I needed samples to run DESeq2 or edgeR (while I don't have any samples data, but just gene-level variables, so I would not have any "coldata: a table with information about the samples"). Can I use DESeq2 without a case/control set up?Maybe you are suggesting to just use the normalization step in this package. From your answer, it sounds like I should first transform count data using DESeq2 (the histogram of vsd still looks bimodal), then get a TPM from this (how? just dividing by gene length? Would it be something like: TPM = 10^6 * (data$counts / data$gene_length) * sum(data$counts / data$gene_length) ?), then log transform. I am reading the manual for DESeq2 now in case I am misunderstanding this completely.
I would either fit a negative binomial, using gene length as an offset (not a co-variate, you need to force the coefficient to 1). Unfortunately getting the gene length is not entirely trival as you need to know which transcripts are expressed. If you can run the original data through salmon, that would give you an effective length.
Or use the vst/rlog transform from DESeq2. You need samples to do a differential expression analysis, but I don't think you need samples to apply VST/rlog, which can be applied to a raw matrix without creating a DESeqDataSet. Once you've done the transformation, then you can convert to TPM.
To convert to TPM, first divide by gene length to get an FPK. Now sum all the FPKs and divide each FPK by the sum of FPKs and multiply by 1,000,000.
Ok that makes sense! If I plot the log (FPKM), it finally looks normal! So the final answer is to use this in a linear regression model. I do think I need to do log(FPKM +1), for the log to work. I just have one more question. I imagine the extra step of "dividing each FPK by the sum of FPKs and multiply by 1,000,000" is for normalizing for sequencing depth. What is the logic of using 1,000,000? I am asking because I have also seen other values used instead of the 1million (granted this was on ATAC-Seq data and not on expression), for example: normalization = log2( counts/sum(counts) x 100,000,000 +1). Just wondering what that multiplication is there for, and whether it is really necessary in this scenario. Thank you!
Depending on which normalisation you've used (VST or rlog), I was under the impression that there should be no zeros, and thus no need for +1. In fact rlog should is already adding be a pseudocount and taking the log.
1,000,000 is chosen because it it the value that puts the numbers and a "human scale". You could choose another scale, but TPM is what people understand.
Yes - I believe the rlog is mostly similar to log2(FPKM+0.1) values.
So, it is more like the log2-transformed values (rather than the linear values that need a rounding factor prior to a log-transformation). There are some examples of the comparisons between rlog and log2(counts/CPM + X) values here: http://cdwscience.blogspot.com/2019/02/variance-stabilization-and-pseudocounts.html