Error on Linear regression bulkRNAseq
1
0
Entering edit mode
3.7 years ago
camillab. ▴ 160

Hi!

I am trying to do a linear regression for age and sex of a bulk-RNAseq composed of 6 human samples (3 untreated and 3 treated) with a total number of genes of 21011 genes. I am trying to reproduce this biostar post and I wanted to identify the residues to see how age and sex impact gene expression but I am having few errors. So my dataset originally look like (those data are in RPKMs - I know I should use the raw data but I do not have it-):

# A tibble: 6 x 11
  `Column ID`  Description  `Associated Gen~ `ct1`
  <chr>        <chr>        <chr>                   <dbl>
1 ENSG0000027~ 7SK RNA [So~ 7SK                     0.702
2 ENSG0000020~ 7SK RNA [So~ 7SK (i)             87779.   
3 ENSG0000027~ 7SK RNA [So~ 7SK (ii)                0.1  
4 ENSG0000012~ alpha-1-B g~ A1BG                    7.58 
5 ENSG0000026~ A1BG antise~ A1BG-AS1                1.22 
6 ENSG0000017~ alpha-2-mac~ A2M                     4.07 
# ... with 7 more variables: ctr2 <dbl>,
#   ct3 <dbl>, t1 <dbl>,
#   t2 <dbl>, t3 <dbl>,
#   log2FC <dbl>, p_value <dbl>

So looking at the post I realized I have to traspose and modify a bit the table so I did:

myData <- alain[-c(10,11)] #remove undesired columns
alain2 <- t(myData) #trasponse
alain2 <- as.data.frame(alain2) #convert into a dataframe
alain2 <- alain2[-c(1, 2),] #remove undesired row columns
colnames(alain2) <- as.character(unlist(alain2[1,]))  #put first row as header
alain2 = alain2[-1, ] # remove first rows

#add samples info (treatment, sex, age)
vec <- c("untreated", "untreated", "untreated", "treatment", "treatment", "treatment")   # Create example vector
alain2$treatment <- vec #add new column

vec1 <- c("M", "F", "M", "F", "F", "F")  # Create example vector
alain2$sex <- vec1 #add new column

vec2 <- c(45, 56, 46, 65, 21, 75)  # Create example vector
alain2$age <- vec2 #add new column

#put treatment/sex/age at the beginning of the table
library(dplyr)
prova <- alain2 %>%
  select("treatment","sex", "age", everything())

The I did want to first check whether these factors (age and sex) influence (statistically) on the treatment by testing each independently:

prova$treatment <- factor(prova$treatment, levels=c("treatment","untreated"))
prova$sex <- factor(prova$treatment, levels=c("F","M"))
prova$age <- as.numeric(prova$age)

but when I do :

summary(lm(treatment ~ age, data=prova))
summary(lm(treatment ~ sex, data=prova))
summary(lm(treatment ~ age + sex, data=prova))

I get this error message for each one:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y' In addition: Warning message: In storage.mode(v) <- "double" : NAs introduced by coercion

So I thought since, my dataset is unfiltered that the presence of genes with 0.1 value may interfere with the results so I wnt back to my original datase and I remove all row that have 0.1 (alain2 <-filter_if(alain, is.numeric, all_vars((.) != 0.1)) ) and then re-run everything but I still get an error:

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels In addition: Warning message: In model.response(mf, "numeric") : using type = "numeric" with a factor response will be ignored

I don't understand what it's wrong with my dataset. also, I tried to calculate the linear regression for one gene (eg. A1BG) specifically and if I do:

summary(lm(A1BG ~ age, data=prova))

it works fine but if I do:

summary(lm(A1BG ~ sex, data=prova))
summary(lm(A1BG ~ age + sex, data=prova))

I get this error suggesting the problem is in the "sex" column where I have only F or M (for female/male). It works fine if I do summary(lm(A1BG ~ treatment, data=prova)) ) so I don't think the problem is the fact that is a character

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels

So I if I check the levels I get this but I do not understand why instead of 2 levels for F, M , I have only NA!:

'data.frame':   6 obs. of  16756 variables:
 $ treatment             : Factor w/ 2 levels "treatment","untreated": 2 2 2 1 1 1
 $ sex                   : Factor w/ 2 levels "F","M": NA NA NA NA NA NA
 $ age                   : num  45 56 46 65 21 75

Thanks for all the help!

Camilla

R linear regression residue • 1.7k views
ADD COMMENT
3
Entering edit mode
3.7 years ago
prova$sex <- factor(prova$treatment, levels=c("F","M"))

should be

prova$sex <- factor(prova$sex, levels=c("F","M"))
ADD COMMENT

Login before adding your answer.

Traffic: 1395 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6