I am trying to correct my RNA seq data for 3 categorical variables as well as preserve the biological information of the dataset. In order to do that, I have used the removeBatchEffect
function from limma
.
I used a log2(TPM counts + 1) matrix as my input but... as you can see in the following screenshot, the corrected counts that I get have negative values. Since it doesn't make sense to have genes with a negative expression, is there any way to solve this? I don't know if the output is log2(tpm) and I just have to do +1 or unlog the corrected data... In any case, I cannot remove the genes with negative values, I would like to keep all of them if it is possible.
Does anybody know how to solve this?
Thanks very much in advance
Code:
set.seed(11344)
covariates_df <- data.frame(
Sample = paste0("A", seq(1,960)),
Age = sample(seq(18, 70), size = 960, replace = TRUE),
Group = sample(seq(1, 3), size = 960, replace = TRUE),
Sex = sample(seq(1, 2), size = 960, replace = TRUE),
WBC = sample(seq(0.5, 35, by=0.3), size = 960, replace = TRUE),
Place = sample(seq(1, 2), size = 960, replace = TRUE),
Tubes = sample(seq(1, 2), size = 960, replace = TRUE),
LibPrep = sample(seq(1, 16), size = 960, replace = TRUE)
)
rownames(covariates_df) <- covariates_df$Sample
categorical_variables <- c("Sample", "Group", "Sex", "Place", "Tubes", "LibPrep")
covariates_df[,categorical_variables] <- lapply(covariates_df[,categorical_variables],as.factor)
str(covariates_df)
'data.frame': 960 obs. of 8 variables:
$ Sample : Factor w/ 960 levels "A1","A10","A100",..: 1 112 223 334 445 556 667 778 889 2 ...
$ Age : int 40 66 37 40 25 42 46 51 45 61 ...
$ Group : Factor w/ 3 levels "1","2","3": 2 2 2 3 2 2 3 3 3 3 ...
$ Sex : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 1 ...
$ WBC : num 12.8 29.9 25.7 0.8 30.8 5.9 30.5 19.1 2.9 12.2 ...
$ Place : Factor w/ 2 levels "1","2": 1 1 2 1 2 1 2 2 1 2 ...
$ Tubes : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 2 2 1 2 ...
$ LibPrep: Factor w/ 16 levels "1","2","3","4",..: 14 1 13 4 10 5 1 8 7 5 ...
dim(log2TPM_1_counts_mat)
[1] 60000 960
str(log2TPM_1_counts_mat)
num [1:60000, 1:960] 0.51 0 2.76 2.64 2.26 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:60000] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
..$ : chr [1:960] "A1" "A2" "A3" "A4" ...
### I wanted to upload the counts data but it is too huge and I didn't manage. I tried to create a small counts data, but I wasn't getting any negative values as I get in reality in order to show you my problem.
Design of the model:
design <- model.matrix(~Age + Group + Sex + WBC + Place + Tubes + LibPrep, data=covariates_df)
bio.design <- design[,1:6]
batch.design <- design[,-(1:6)]
Adjustment with removeBatchEffect
:
corrected_cts <- limma::removeBatchEffect(log2TPM_1_counts_mat,design=bio.design,covariates=batch.design)
Thanks very much for your reply.
I think I will continue with the negative values, because I don't want to transform more the data than batch correction does... but thanks for giving other alternative and explain how to do it, I really appreciate it.
What is "log1p"?
Abbreviation for log(1+x); from https://numpy.org/doc/stable/reference/generated/numpy.log1p.html