Dear all,
I am doing a DGEA with DESeq2 and data imported with tximport. I have an unbalanced dataset as reported below. With the counts and the metadata that I have I would like to answer different questions.
- I would like to look at the differential expressed genes between the different Lines (e.g. Line "C" vs. Line "15" and all the combinations) while controlling for differences in Person_ID and Location
I would like to look at the differential expressed genes between the different Locations (e.g. Location "A" vs. Location "R" and all the other combinations) while controlling for differences in Person_ID and Line
I would like to look at the differential expressed genes between two different Lines in the same Location, e.g. considering Location A: which is the difference between Line 15 and Line 20 ?
- I would like to look at the differential expressed genes between two different Locations in the same Line, e.g. considering Line 20: which is the difference between Location A and Location D ?
Here is the code that I use to build the DESeq2 object:
dir<-"path/quantif"
setwd(dir)
samples<-list.files(dir)
files <-file.path(dir,samples,"t_data.ctab")
names(files)<-substr(str_split_i(files, "/", 6), 1, 10)
all(file.exists(files))
tmp <- read.csv(files[1], sep="\t")
head(tmp)
tx2gene <- tmp[, c("t_name", "gene_name")]
head(tx2gene)
txi<-tximport::tximport(files, type = "stringtie", tx2gene = tx2gene)
head(txi$counts)
samples_meta<-read.csv(file="/path/metadata.csv", sep="," , header = TRUE)
txi$counts <- txi$counts[, rownames(samples_meta)]
dds <-DESeqDataSetFromTximport(txi, colData = samples_meta, design = ~ Line) # which design do I have to use??
dds$Line <- relevel(dds$Line, ref = "C") # do I need a reference level?
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds<-DESeq(dds)
Metadata of the object:
Sample_ID Person_ID Line Location
4 SA18082157 KK1451 15 A
7 SA18083382 KK1473N 15 S
12 SA18083387 KK1450N 15 R
14 SA18083360 KK1480N 15 S
19 SA18083365 KK1551N 15 R
25 SA18083368 KK1471N 15 A
27 SA18086317 KK1443N 15 D
32 SA18051387 KK1868N 15 D
38 SA18051384 KK1865N 15 S
41 SA18051386 KK1601N 15 A
18 SA18083364 KK1551N 18 R
23 SA18083366 KK1471N 18 A
33 SA18088686 KK1671N 18 A
36 SA18088660 KK1434N 18 R
2 SA18082155 KK1451 20 A
6 SA18083381 KK1473N 20 S
10 SA18083386 KK1450N 20 R
16 SA18083344 KK1480N 20 S
17 SA18083363 KK1551N 20 R
24 SA18082300 KK1471N 20 A
26 SA18086318 KK1443N 20 D
29 SA18086315 KK1374N 20 D
31 SA18051386 KK1868N 20 D
34 SA18086313 KK1671N 20 A
37 SA18051383 KK1865N 20 S
40 SA18051388 KK1601N 20 A
3 SA18083341 KK1451 C A
8 SA18083346 KK1473N C S
11 SA18083343 KK1450N C R
15 SA18083345 KK1480N C S
20 SA18083342 KK1551N C R
1 SA18082156 KK1451 I A
5 SA18083383 KK1473N I S
13 SA18083386 KK1480N I S
22 SA18083367 KK1471N I A
30 SA18086316 KK1374N I D
35 SA18086318 KK1671N I A
39 SA18051385 KK1865N I S
42 SA18051360 KK1601N I A
9 SA18083385 KK1450N M R
21 SA18082301 KK1471N M A
28 SA18088662 KK1443N M D
Which is the best design to answer all these questions? Do I need different designs for all the questions?
Once the design is set, how can I extract the right contrast?
sessionInfo( )
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
[5] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 tidyverse_1.3.1
[9] ggplot2_3.4.4 DESeq2_1.40.2 SummarizedExperiment_1.30.2 Biobase_2.60.0
[13] MatrixGenerics_1.12.3 matrixStats_1.0.0 GenomicRanges_1.52.1 GenomeInfoDb_1.36.4
[17] IRanges_2.34.1 S4Vectors_0.38.2 BiocGenerics_0.46.0 tximport_1.28.0
loaded via a namespace (and not attached):
[1] gtable_0.3.4 lattice_0.22-5 tzdb_0.4.0 vctrs_0.6.4 tools_4.3.1
[6] bitops_1.0-7 generics_0.1.3 parallel_4.3.1 fansi_1.0.5 pkgconfig_2.0.3
[11] Matrix_1.6-1.1 dbplyr_2.3.4 readxl_1.4.3 lifecycle_1.0.3 GenomeInfoDbData_1.2.10
[16] compiler_4.3.1 munsell_0.5.0 codetools_0.2-19 RCurl_1.98-1.12 pillar_1.9.0
[21] crayon_1.5.2 BiocParallel_1.34.2 DelayedArray_0.26.7 abind_1.4-5 rvest_1.0.3
[26] tidyselect_1.2.0 locfit_1.5-9.8 stringi_1.7.12 grid_4.3.1 colorspace_2.1-0
[31] cli_3.6.1 magrittr_2.0.3 S4Arrays_1.0.6 utf8_1.2.3 broom_1.0.5
[36] withr_2.5.1 scales_1.2.1 backports_1.4.1 timechange_0.2.0 lubridate_1.9.3
[41] modelr_0.1.11 XVector_0.40.0 httr_1.4.7 cellranger_1.1.0 hms_1.1.3
[46] haven_2.5.3 rlang_1.1.1 Rcpp_1.0.11 glue_1.6.2 DBI_1.1.3
[51] xml2_1.3.5 reprex_2.0.2 rstudioapi_0.15.0 jsonlite_1.8.7 R6_2.5.1
[56] fs_1.6.3 zlibbioc_1.46.0