Question

Question about generalized linear model fitting

0

Entering edit mode

7.2 years ago

kanwarjag ★ 1.2k

I am analyzing a scRNAseq data trying to follow a protocol suggested in this paper- https://www.nature.com/articles/nature24489 mentioned under Computational analysis The statement reads- " Selection of variable genes was performed by fitting a generalized linear model to the relationship between the squared coefficient of variation and the mean expression level in logarithmic space, and selecting genes that deviated significantly (P < 0.05) from the fitted curve"

I have two naive questions: 1. In general GLM can be done in two data columns. However in my case there are several columns of data corresponding to single cells. How can I use the matrix for GLM. I looked into JMP and past3 documentation.

What is the best GUI based software or an R package that can fit the model for multiple columns data and out put p value for each row (gene).

Apologies, if it looks like a very naive question. However, I tried but could not overcome this bump.

Thanks

RNA-Seq • 3.8k views

ADD COMMENT • link updated 5.7 years ago by pshubhamoy ▴ 20 • written 7.2 years ago by kanwarjag ★ 1.2k

0

Entering edit mode

Hi Kevin, I am using the code you have provided. How many genes can I loop using the code you provided? I have various groups having ~90-180 DE genes. How to save and recall print(summary(model)) outputs like this? Call:

glm(formula = as.formula(formula), family = binomial(link = "logit"), 
    data = rldT_IvCFAsig_I)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2912  -1.1918   0.1993   1.1402   1.2335  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -2.096      2.987  -0.702    0.483
Cela2a         0.990      1.334   0.742    0.458

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 37.096  on 26  degrees of freedom
Residual deviance: 34.757  on 25  degrees of freedom
AIC: 38.757

Number of Fisher Scoring iterations: 6


Call:
glm(formula = as.formula(formula), family = binomial(link = "logit"), 
    data = rldT_IvCFAsig_I)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.419  -1.158   0.000   1.103   1.376  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -8.425      7.571  -1.113    0.266
Cel            3.188      2.836   1.124    0.261

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 37.096  on 26  degrees of freedom
Residual deviance: 33.243  on 25  degrees of freedom
AIC: 37.243

thanks

ADD REPLY • link updated 5.7 years ago by GenoMax 149k • written 5.7 years ago by pshubhamoy ▴ 20

score 5 · Answer 1 · 2017-12-16

5

Entering edit mode

7.2 years ago

Kevin Blighe 89k

Edit February 5th 2019:

This same functionality (below) can now be performed by my Bioconductor package, RegParallel

-----------------------------

The data that you have to supply to the GLM function in R will have genes as the columns and sample values as rows, with the first column as the factor / categorical variable of interest, i.e., the one that you are modelling your expression to, like this:

modeling

           CaseControlStatus   MYC    EGFR   KIF2C   CDC20
Sample 1   Case               11.39  10.62   9.75   10.34
Sample 2   Case               10.16   8.63   8.68    9.08
Sample 3   Case                9.29  10.24   9.89   10.11
Sample 4   Control            11.53   9.22   9.35    9.13
Sample 5   Control             8.35  10.62  10.25   10.01
Sample 6   Control            11.71  10.43   8.87    9.44
...

You should conduct the modeling independently over each gene and note the model coefficient and P value. Your final list of genes will then be those that pass significance at α=5%. Here is a simple loop to do this:

formula.start <- "CaseControlStatus ~ "
listModels <- list()

for (i in 2:ncol(modeling))
{
    formula <- paste(formula.start, colnames(modeling)[i], sep="")
    model <- glm(as.formula(formula), data=modeling, family=binomial(link="logit"))
    listModels[[i]] <- model
    names(listModels)[i]) <- colnames(modeling)[i]
    print(summary(model))
}

This will loop over each gene (column) in your data-set and perform the regression independently. In this simple example, each model summary will be printed to screen and each model will be stored in list object.

For further information, take a look at my other threads on creating gene signatures and generally performing regression analyses:

ADD COMMENT • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

i followed it from the other thread so its irrepective of the data like it can work for both microarray and rna-seq data ? The GLM which you have used to here ..

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

Yes, it can be used for both microarray and RNA-seq data. In both cases, the data should preferably be normally-distributed. Otherwise, you will have to modify the family parameter to glm()

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

okay so another question so if i use rna seq data can i used the count normalised data set which the rlog normalised or i will have to plot the data and see if its normally distributed first then proceed?

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

The normalised count data from RNA-seq will not follow a normal distribution - it follows a negative binomial distribution, which resembles a type of Poisson. Technically you can use these and specify a different family (DESeq2 itself just uses a generalised linear model with a different family for the purposes of deriving test statistics via the Wald test).

In your case, I think that it may be better to use the rlog counts from RNA-seq (which should follow a normal distribution but may have 'humps' here and there), and the log2 values from microarray obviously.

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Error in eval(predvars, data, env) : object 'ARID1ATAF12' not found
In addition: Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred

Im running into this error "ARID1ATAF12" this thing corresponds to my first two column containing genes "ARID1A " and " TAF12 " im not sure what is going wrong i have made my data frame as you suggested

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Must be some formatting error... check the contents of your model formula for the for loop indices 2 and 3.

On that note, note that my for loop goes from 2 to ncol(modelling). This is because my first column is the categorical variable that I am predicting (e.g. case/control). This may not be the case for your data.

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

head(MyData)
        CaseControlStatus   ARID1A     TAF12      PHC2     SCMH1     DMAP1   RAD54L   PRKAA2     PRMT6     VPS72
Sample1                WT 2367.372 150.69990 495.92550 841.92010 230.66300 34.59946 33.83058 180.68610 264.49360
Sample2                WT 2522.512  82.90774 199.33136 635.03797 137.59156 35.27989 58.21181 132.29958 170.22546
Sample3                WT 1507.287 114.27604 134.52749 520.75159 112.82951 67.98701 18.80492  82.45234  95.47112
Sample4                WT 2888.628 116.45255 229.88036 609.48542 182.99686 62.00720 18.14845 238.95458 149.72471
Sample5              TEST 1127.155 222.20356  26.06857 



formula <- "CaseControlStatus ~ "
listModels <- list()

for (i in 2:ncol(MyData))
{
  formula <- paste(formula, colnames(MyData)[i], sep="")
  model <- glm(as.formula(formula), data=MyData, family=binomial(link="logit"))
  listModels[[i]] <- model
  names(listModels)[i] <-colnames(MyData)[i]
print(summary(model))
}

Error in eval(predvars, data, env) : object 'ARID1ATAF12' not found In addition: Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred

have a look

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

Oh, I now see what is happening. You are concatenating more and more text to the formula variable with each loop (most likely my fault as it wasn't specified in the previous code that I pasted).

You could try this:

formula.start <- "CaseControlStatus ~ "
listModels <- list()

for (i in 2:ncol(MyData))
{
  formula <- paste(formula.start, colnames(MyData)[i], sep="")
  model <- glm(as.formula(formula), data=MyData, family=binomial(link="logit"))
  listModels[[i]] <- model
  names(listModels)[i] <-colnames(MyData)[i]
  print(summary(model))
}

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

@Kevin thanks a lot these are pretty small things but since I don't have that proper grasp of R so i struggle for all these, now its looping over the whole data set ,how do i filter now based on those observed output lets say i get something even if doesn't make any real biological sense which im not sure would i still consider it or any proper way of filtering to go to the next step of analysis

ADD REPLY • link 7.0 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

Hi krushnach, yes, this simple code will just output the model. You will have to decide which values from the model that you want to retain. You should look at the information under the 'Value' header here: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/glm.html

You can also check what an R object contains using the str() function applied to the object. These are then usually access via the use of the dollar ($), e.g., model$coefficients. In this situation, the summary() function already provides useful information that can then be extracted via array subsetting ( [ ... ] ).

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

@kevin "CaseControlStatus" the formula you have used in this case what is the response and the predictor variable can you explain ?

ADD REPLY • link 6.9 years ago by 1769mkc ★ 1.2k