Question

Filtering seurat object

0

Entering edit mode

4 months ago

marco.barr ▴ 150

Hello everyone, I'm trying to implement a gene-level filter on my Seurat object. In my assay, I have two layers: one for a control sample and the other for a patient sample. This is my code.

layers <- c("counts.ctl_raw_feature_bc_matrix", "counts.pt_raw_feature_bc_matrix")
for (layer in layers) {
    counts <- LayerData(filtered_seurat, assay = "RNA", layer = layer)
    nonzero <- counts > 0
    keep_genes <- Matrix::rowSums(nonzero) >= 10
    filtered_counts <- counts[keep_genes, ]
    filtered_seurat <- SetAssayData(filtered_seurat, assay = "RNA", slot = layer, new.data = filtered_counts)
}

However, when I update the filtered data directly on the assay, I get the following warnings:

Warning: Different features in new layer data than already exist for counts.ctl_raw_feature_bc_matrix
Warning: Different features in new layer data than already exist for counts.pt_raw_feature_bc_matrix

Despite reducing the size, when I check the number of features with num_features_filtered <- nrow(filtered_seurat) on the new filtered object, I obtain the same number of rows as the unfiltered one. I'm unsure if the code is working correctly or not. How can I resolve this?

Additionally, I have a general question: how should one handle analysis with multiple layers when performing filters, assigning cell cycle scores, and identifying/removing doublets? Thank you very much.

seurat scRNA • 1.2k views

ADD COMMENT • link 4 months ago by marco.barr ▴ 150

score 0 · Answer 1 · 2024-07-02

Despite the layers being unique, Seurat still maintains a single Features/Cells validation. This means that when you mess with individual layers like that, it will still maintain the widest set of features that captures all features in all the layers and just zero out any missing features in an individual layer.

If you want to filter features where there are low individual gene counts, you can make a list of Features to keep and then just run:

filtered_seurat <- subset(filtered_seurat, features = keep.features)

Additionally, I have a general question: how should one handle analysis with multiple layers when performing filters, assigning cell cycle scores, and identifying/removing doublets?

I run DoubletFinder on individual libraries before merging. This is more to conserve memory as I run exploratory analysis on a workstation. You can use the LayerData() function above to pull out individual counts. Here are some generic functions I use adapted from DoubletFinder's github.

## spits out the expected multiplet rate based on input value of library size
get.10x.multiplets <- function(x){
  expected.df <- data.frame(
    "Cells.Recovered" = c(500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000),
    "Multiplet.Rate" = c(0.4, 0.8, 1.6, 2.4, 3.2, 4.0, 4.8, 5.6, 6.4, 7.2, 8.0)
  )
  lm.res <- lm(Multiplet.Rate~Cells.Recovered, data = expected.df)
  test.df <- data.frame("Cells.Recovered" = x)
  predict(lm.res, test.df)
}

## runs DoubletFinder on a Seurat v5 object 
get.10x.doublets <- function(x){

  require(DoubletFinder)

  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 1000)
  x <- ScaleData(x, features = VariableFeatures(x))
  x <- RunPCA(x, features = VariableFeatures(x), verbose = FALSE)
  x <- RunUMAP(x, reduction = "pca", dims = 1:20, reduction.key = "UMAPpca_", reduction.name = "UMAP_PCA")
  x <- FindNeighbors(x, reduction = "pca", dims = 1:20)
  x <- FindClusters(x, resolution = 0.3)

  x.sweep <- paramSweep(x, PCs = 1:20, sct = FALSE)
  x.sweep.stats <- summarizeSweep(x.sweep, GT = FALSE)
  x.bcmvn <- find.pK(x.sweep.stats)
  x.pK <- as.numeric(as.character(x.bcmvn$pK[which.max(x.bcmvn$BCmetric)]))
  homotypic.prop <- modelHomotypic(x$seurat_clusters)
  mlt.est <- get.10x.multiplets(nrow(x@meta.data))/100
  nExp_poi <- round(mlt.est*nrow(x@meta.data))
  nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop))
  x <- doubletFinder(x, PCs = 1:20, pN = 0.25, pK = x.pK, nExp = nExp_poi, sct = FALSE, reuse.pANN = FALSE)
  x <- doubletFinder(x, PCs = 1:20, pN = 0.25, pK = x.pK, nExp = nExp_poi.adj, sct = FALSE, 
                     reuse.pANN = grep("pANN", colnames(x@meta.data), value = TRUE))
  doublet.df <- x@meta.data[,grep("DF.classifications", colnames(x@meta.data), value = TRUE)]
  colnames(doublet.df) <- c("Estimate", "Adj.Estimate")
  doublet.df$Cellname <- rownames(doublet.df)
  doublet.df
}

## calculate doublet probabilities using DoubletFinder - takes a while
doublet.res.list <- lapply(seurat.list, get.10x.doublets)
doublet.res <- do.call(rbind, doublet.res.list)

If anyone has a better way or spots a mistake, please feel free to share.

I do all my filtering by defining cells/features to keep and then running the relevant seuratobject <- subset(seuratobject, ...) command. Cell cycle scoring depends on what package you want to use eg: Seurat's own, tricycle etc. For Seurat's own, the vignettes on the site can walk you through an example analysis.