Hi!
I am trying to do a linear regression for age and sex of a bulk-RNAseq composed of 6 human samples (3 untreated and 3 treated) with a total number of genes of 21011 genes. I am trying to reproduce this biostar post and I wanted to identify the residues to see how age and sex impact gene expression but I am having few errors. So my dataset originally look like (those data are in RPKMs - I know I should use the raw data but I do not have it-):
# A tibble: 6 x 11
`Column ID` Description `Associated Gen~ `ct1`
<chr> <chr> <chr> <dbl>
1 ENSG0000027~ 7SK RNA [So~ 7SK 0.702
2 ENSG0000020~ 7SK RNA [So~ 7SK (i) 87779.
3 ENSG0000027~ 7SK RNA [So~ 7SK (ii) 0.1
4 ENSG0000012~ alpha-1-B g~ A1BG 7.58
5 ENSG0000026~ A1BG antise~ A1BG-AS1 1.22
6 ENSG0000017~ alpha-2-mac~ A2M 4.07
# ... with 7 more variables: ctr2 <dbl>,
# ct3 <dbl>, t1 <dbl>,
# t2 <dbl>, t3 <dbl>,
# log2FC <dbl>, p_value <dbl>
So looking at the post I realized I have to traspose and modify a bit the table so I did:
myData <- alain[-c(10,11)] #remove undesired columns
alain2 <- t(myData) #trasponse
alain2 <- as.data.frame(alain2) #convert into a dataframe
alain2 <- alain2[-c(1, 2),] #remove undesired row columns
colnames(alain2) <- as.character(unlist(alain2[1,])) #put first row as header
alain2 = alain2[-1, ] # remove first rows
#add samples info (treatment, sex, age)
vec <- c("untreated", "untreated", "untreated", "treatment", "treatment", "treatment") # Create example vector
alain2$treatment <- vec #add new column
vec1 <- c("M", "F", "M", "F", "F", "F") # Create example vector
alain2$sex <- vec1 #add new column
vec2 <- c(45, 56, 46, 65, 21, 75) # Create example vector
alain2$age <- vec2 #add new column
#put treatment/sex/age at the beginning of the table
library(dplyr)
prova <- alain2 %>%
select("treatment","sex", "age", everything())
The I did want to first check whether these factors (age and sex) influence (statistically) on the treatment by testing each independently:
prova$treatment <- factor(prova$treatment, levels=c("treatment","untreated"))
prova$sex <- factor(prova$treatment, levels=c("F","M"))
prova$age <- as.numeric(prova$age)
but when I do :
summary(lm(treatment ~ age, data=prova))
summary(lm(treatment ~ sex, data=prova))
summary(lm(treatment ~ age + sex, data=prova))
I get this error message for each one:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y' In addition: Warning message: In storage.mode(v) <- "double" : NAs introduced by coercion
So I thought since, my dataset is unfiltered that the presence of genes with 0.1 value may interfere with the results so I wnt back to my original datase and I remove all row that have 0.1 (alain2 <-filter_if(alain, is.numeric, all_vars((.) != 0.1))
) and then re-run everything but I still get an error:
Error in
contrasts<-
(*tmp*
, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels In addition: Warning message: In model.response(mf, "numeric") : using type = "numeric" with a factor response will be ignored
I don't understand what it's wrong with my dataset. also, I tried to calculate the linear regression for one gene (eg. A1BG) specifically and if I do:
summary(lm(A1BG ~ age, data=prova))
it works fine but if I do:
summary(lm(A1BG ~ sex, data=prova))
summary(lm(A1BG ~ age + sex, data=prova))
I get this error suggesting the problem is in the "sex" column where I have only F or M (for female/male). It works fine if I do summary(lm(A1BG ~ treatment, data=prova))
) so I don't think the problem is the fact that is a character
Error in
contrasts<-
(*tmp*
, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
So I if I check the levels I get this but I do not understand why instead of 2 levels for F, M , I have only NA!:
'data.frame': 6 obs. of 16756 variables:
$ treatment : Factor w/ 2 levels "treatment","untreated": 2 2 2 1 1 1
$ sex : Factor w/ 2 levels "F","M": NA NA NA NA NA NA
$ age : num 45 56 46 65 21 75
Thanks for all the help!
Camilla