Question

Error on Linear regression bulkRNAseq

0

Entering edit mode

4.0 years ago

camillab. ▴ 160

Hi!

I am trying to do a linear regression for age and sex of a bulk-RNAseq composed of 6 human samples (3 untreated and 3 treated) with a total number of genes of 21011 genes. I am trying to reproduce this biostar post and I wanted to identify the residues to see how age and sex impact gene expression but I am having few errors. So my dataset originally look like (those data are in RPKMs - I know I should use the raw data but I do not have it-):

# A tibble: 6 x 11
  `Column ID`  Description  `Associated Gen~ `ct1`
  <chr>        <chr>        <chr>                   <dbl>
1 ENSG0000027~ 7SK RNA [So~ 7SK                     0.702
2 ENSG0000020~ 7SK RNA [So~ 7SK (i)             87779.   
3 ENSG0000027~ 7SK RNA [So~ 7SK (ii)                0.1  
4 ENSG0000012~ alpha-1-B g~ A1BG                    7.58 
5 ENSG0000026~ A1BG antise~ A1BG-AS1                1.22 
6 ENSG0000017~ alpha-2-mac~ A2M                     4.07 
# ... with 7 more variables: ctr2 <dbl>,
#   ct3 <dbl>, t1 <dbl>,
#   t2 <dbl>, t3 <dbl>,
#   log2FC <dbl>, p_value <dbl>

So looking at the post I realized I have to traspose and modify a bit the table so I did:

myData <- alain[-c(10,11)] #remove undesired columns
alain2 <- t(myData) #trasponse
alain2 <- as.data.frame(alain2) #convert into a dataframe
alain2 <- alain2[-c(1, 2),] #remove undesired row columns
colnames(alain2) <- as.character(unlist(alain2[1,]))  #put first row as header
alain2 = alain2[-1, ] # remove first rows

#add samples info (treatment, sex, age)
vec <- c("untreated", "untreated", "untreated", "treatment", "treatment", "treatment")   # Create example vector
alain2$treatment <- vec #add new column

vec1 <- c("M", "F", "M", "F", "F", "F")  # Create example vector
alain2$sex <- vec1 #add new column

vec2 <- c(45, 56, 46, 65, 21, 75)  # Create example vector
alain2$age <- vec2 #add new column

#put treatment/sex/age at the beginning of the table
library(dplyr)
prova <- alain2 %>%
  select("treatment","sex", "age", everything())

The I did want to first check whether these factors (age and sex) influence (statistically) on the treatment by testing each independently:

prova$treatment <- factor(prova$treatment, levels=c("treatment","untreated"))
prova$sex <- factor(prova$treatment, levels=c("F","M"))
prova$age <- as.numeric(prova$age)

but when I do :

summary(lm(treatment ~ age, data=prova))
summary(lm(treatment ~ sex, data=prova))
summary(lm(treatment ~ age + sex, data=prova))

I get this error message for each one:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y' In addition: Warning message: In storage.mode(v) <- "double" : NAs introduced by coercion

So I thought since, my dataset is unfiltered that the presence of genes with 0.1 value may interfere with the results so I wnt back to my original datase and I remove all row that have 0.1 (alain2 <-filter_if(alain, is.numeric, all_vars((.) != 0.1)) ) and then re-run everything but I still get an error:

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels In addition: Warning message: In model.response(mf, "numeric") : using type = "numeric" with a factor response will be ignored

I don't understand what it's wrong with my dataset. also, I tried to calculate the linear regression for one gene (eg. A1BG) specifically and if I do:

summary(lm(A1BG ~ age, data=prova))

it works fine but if I do:

summary(lm(A1BG ~ sex, data=prova))
summary(lm(A1BG ~ age + sex, data=prova))

I get this error suggesting the problem is in the "sex" column where I have only F or M (for female/male). It works fine if I do summary(lm(A1BG ~ treatment, data=prova)) ) so I don't think the problem is the fact that is a character

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels

So I if I check the levels I get this but I do not understand why instead of 2 levels for F, M , I have only NA!:

'data.frame':   6 obs. of  16756 variables:
 $ treatment             : Factor w/ 2 levels "treatment","untreated": 2 2 2 1 1 1
 $ sex                   : Factor w/ 2 levels "F","M": NA NA NA NA NA NA
 $ age                   : num  45 56 46 65 21 75

Thanks for all the help!

Camilla

R linear regression residue • 1.7k views

ADD COMMENT • link updated 4.0 years ago by rpolicastro 13k • written 4.0 years ago by camillab. ▴ 160

score 3 · Accepted Answer · 2021-03-04

3

Entering edit mode

4.0 years ago

rpolicastro 13k

prova$sex <- factor(prova$treatment, levels=c("F","M"))

should be

prova$sex <- factor(prova$sex, levels=c("F","M"))

ADD COMMENT • link 4.0 years ago by rpolicastro 13k