I'm testing the Monocle3 (version 0.2.0) with the following code. The main aim of this analysis is to identify differentially expressed genes for every time point.
library(monocle3)
library(dplyr) # imported for some downstream data manipulation
expression_matrix <- readRDS(url("http://staff.washington.edu/hpliner/data/packer_embryo_expression.rds"))
cell_metadata <- readRDS(url("http://staff.washington.edu/hpliner/data/packer_embryo_colData.rds"))
gene_annotation <- readRDS(url("http://staff.washington.edu/hpliner/data/packer_embryo_rowData.rds"))
cds <- new_cell_data_set(expression_matrix,
cell_metadata = cell_metadata,
gene_metadata = gene_annotation)
cds <- preprocess_cds(cds, num_dim = 50)
cds <- reduce_dimension(cds, reduction_method = "UMAP")
gene_fits <- fit_models(cds, model_formula_str = "~time.point + 0", cores = 15)
fit_coefs <- coefficient_table(gene_fits) %>%
dplyr::select(-model, -model_summary)
There I perform the regression DEG analysis with the categorical term, e.g. ~time.point + 0.
Using the q_value threshold <=0.05. I still get too many DEGs.
fit_coefs %>%
filter(q_value <= 0.05) %>%
group_by(term) %>%
summarise(n = n())
Which gives me:
# A tibble: 4 x 2
term n
<chr> <int>
1 time.point300_minutes 14324
2 time.point400_minutes 16615
3 time.point500_minutes 18557
4 time.pointmixed 18114
Notice that the DEGs are above 10K genes, which is unusually high. One would expect they are in the range of hundreds. Please advise what's the reasonable way to select the DEGs in this case.
If I'm wrong, I'd like to know how to properly interpret the DEG in this case, for example with DEG 14324, does that mean that there are 14324 genes that are specifically expressed in time.point300_minutes? That is unlikely because I checked there are around 13K genes that appear in all 4 time-points.