Question

Speed up umap parameter optimisation?

0

Entering edit mode

5 months ago

noodlejackson ▴ 40

Hi everyone,

I am clustering cells from scRNA-seq by UMAP after PCA. I am now optimising min.dist and n.neighbors to improve the clarity of how the clusters are represented. This incolves generating >4 umaps. Is there a common method to speed this process up, as the dataset is large? Is it common to do this using only a fraction of the dataset, or would that not be representative?

Thanks in advance for any help!

rna-seq umap r • 1.3k views

ADD COMMENT • link updated 5 months ago by jared.andrews07 ★ 19k • written 5 months ago by noodlejackson ▴ 40

1

Entering edit mode

Not that I am aware of. Most of the time UMAP is just visualization. Personally, I prefer to have points scattered out rather than bunched up, and I go well with spread and min.dist of 0.75. Rest (referring to uwot::umap and scater::runUMAP in R) I leave at default.

Is it common to do this using only a fraction of the dataset, or would that not be representative?

That entirely depends on the dataset and the message you want to send as well as the narrative of the story.

ADD REPLY • link 5 months ago by ATpoint 88k

0

Entering edit mode

Thanks ATpoint !

ADD REPLY • link 5 months ago by noodlejackson ▴ 40

score 1 · Answer 1 · 2025-02-07

Using parallelization can help, so be sure to set BPPARAM if using scater's runUMAP. It still takes a while though. I agree with ATpoint in that I tend to prefer greater spread than the defaults, though I have have never gone up to min.dist of 0.75. Here's a function I use to generate a whole bunch of them, which I then viz and pretty arbitrarily pick whichever one best balances global/local structure for my needs:

library(SingleCellExperiment)
library(scater)
library(BiocParallel)

#' @param sce SingleCellExperiment object.
#' @param dimred Character scalar indicating the name of the dimensionality reduction to use as input.
#' @param min_dist Numeric vector indicating parameters to sweep for min_dist UMAP parameter.
#' @param n_neighbors Numeric vector indicating parameters to sweep for n_neighbors UMAP parameter.
#' @param spread Numeric vector indicating parameters to sweep for spread UMAP parameter. 
#'   In combination with min_dist, this controls the "clumpiness" of the cells.
#' @param BPPARAM BiocParallelParam object to use for parallelization.
umap_sweep <- function(sce, dim_reduc, 
                       min_dist = c(0.01, 0.02, 0.05, 0.1, 0.2, 0.3), 
                       n_neighbors = c(10, 15, 20, 30, 40, 50),
                       spread = c(0.8, 1, 1.2),
                       BPPARAM = BiocParallel::bpparam()
                       ) {

  for (d in min_dist) {
    for (n in n_neighbors) {
      for (sp in spread) {
        message("Running UMAP with min_dist = ", d, ", n_neighbors = ", n, ", spread = ", sp)
        sce <- runUMAP(sce, n_neighbors = n, min_dist = d, spread = sp,
                       name = paste0("UMAP_m.dist", d, "_n.neigh", n, "_spread", sp), 
                       dimred = dim_reduc, ncomponents = 2, BPPARAM = BPPARAM)
      }
    }
  }

  return(sce)
}

sce <- umap_sweep(sce, dim_reduc = "PCA")

Generally, I never find those with min.dist < 0.1 to be what I'm looking for, so feel free to adjust the defaults/inputs here to play with things more (or limit them so things run faster). n_neighbors also has a relatively limited impact in comparison to the other parameters.