Speed up umap parameter optimisation?
1
0
Entering edit mode
3 months ago

Hi everyone,

I am clustering cells from scRNA-seq by UMAP after PCA. I am now optimising min.dist and n.neighbors to improve the clarity of how the clusters are represented. This incolves generating >4 umaps. Is there a common method to speed this process up, as the dataset is large? Is it common to do this using only a fraction of the dataset, or would that not be representative?

Thanks in advance for any help!

rna-seq umap r • 826 views
ADD COMMENT
1
Entering edit mode

Not that I am aware of. Most of the time UMAP is just visualization. Personally, I prefer to have points scattered out rather than bunched up, and I go well with spread and min.dist of 0.75. Rest (referring to uwot::umap and scater::runUMAP in R) I leave at default.

Is it common to do this using only a fraction of the dataset, or would that not be representative?

That entirely depends on the dataset and the message you want to send as well as the narrative of the story.

ADD REPLY
0
Entering edit mode

Thanks ATpoint !

ADD REPLY
1
Entering edit mode
3 months ago

Using parallelization can help, so be sure to set BPPARAM if using scater's runUMAP. It still takes a while though. I agree with ATpoint in that I tend to prefer greater spread than the defaults, though I have have never gone up to min.dist of 0.75. Here's a function I use to generate a whole bunch of them, which I then viz and pretty arbitrarily pick whichever one best balances global/local structure for my needs:

library(SingleCellExperiment)
library(scater)
library(BiocParallel)

#' @param sce SingleCellExperiment object.
#' @param dimred Character scalar indicating the name of the dimensionality reduction to use as input.
#' @param min_dist Numeric vector indicating parameters to sweep for min_dist UMAP parameter.
#' @param n_neighbors Numeric vector indicating parameters to sweep for n_neighbors UMAP parameter.
#' @param spread Numeric vector indicating parameters to sweep for spread UMAP parameter. 
#'   In combination with min_dist, this controls the "clumpiness" of the cells.
#' @param BPPARAM BiocParallelParam object to use for parallelization.
umap_sweep <- function(sce, dim_reduc, 
                       min_dist = c(0.01, 0.02, 0.05, 0.1, 0.2, 0.3), 
                       n_neighbors = c(10, 15, 20, 30, 40, 50),
                       spread = c(0.8, 1, 1.2),
                       BPPARAM = BiocParallel::bpparam()
                       ) {

  for (d in min_dist) {
    for (n in n_neighbors) {
      for (sp in spread) {
        message("Running UMAP with min_dist = ", d, ", n_neighbors = ", n, ", spread = ", sp)
        sce <- runUMAP(sce, n_neighbors = n, min_dist = d, spread = sp,
                       name = paste0("UMAP_m.dist", d, "_n.neigh", n, "_spread", sp), 
                       dimred = dim_reduc, ncomponents = 2, BPPARAM = BPPARAM)
      }
    }
  }

  return(sce)
}

sce <- umap_sweep(sce, dim_reduc = "PCA")

Generally, I never find those with min.dist < 0.1 to be what I'm looking for, so feel free to adjust the defaults/inputs here to play with things more (or limit them so things run faster). n_neighbors also has a relatively limited impact in comparison to the other parameters.

ADD COMMENT
1
Entering edit mode

Suggest to set a seed.

ADD REPLY
0
Entering edit mode

Thanks so much Jared. Any idea off-hand whether seurat umap can be run in a parallel approach?

ADD REPLY
0
Entering edit mode

I read up on future.apply. Going to try the following code:

library(Seurat)
library(future.apply)
library(ggplot2)

# Set up parallel backend (adjust based on your CPU)
plan("multisession", workers = 4)

# Define grid of hyperparameters
param_grid <- expand.grid(
  n.neighbors = c(25, 50, 75),
  min.dist = c(0.1, 0.2, 0.3),
  dims = list(1:10)  # Keep dims fixed at 1:10
)

# Convert to a list format
param_list <- split(param_grid, seq(nrow(param_grid)))

# Function to run UMAP and generate plots
run_umap_plot <- function(params) {
  obj <- RunUMAP(pbmc_small, reduction = "pca", 
                 dims = unlist(params$dims), 
                 n.neighbors = params$n.neighbors, 
                 min.dist = params$min.dist)

  plot <- DimPlot(obj, reduction = "umap") + 
    ggtitle(paste0("n.neighbors: ", params$n.neighbors, 
                   ", min.dist: ", params$min.dist))

  return(list(umap_object = obj, plot = plot))
}

# Run UMAP in parallel and save plots
umap_results <- future_lapply(param_list, run_umap_plot)

# Extract the plots into a list
umap_plots <- lapply(umap_results, function(x) x$plot)

# Reset parallel backend
plan("sequential")

# Example: Display all plots using gridExtra
library(gridExtra)
grid.arrange(grobs = umap_plots, ncol = 3)
ADD REPLY
0
Entering edit mode

Yes, definitely set a seed prior to use.

ADD REPLY

Login before adding your answer.

Traffic: 2340 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6