Question

snRNA-seq of the healthy human brain

1

Entering edit mode

7 months ago

AlexStar ▴ 180

I am conducting a Scanpy analysis on human brain snRNA-seq data. As part of the quality control process, I need to filter out low-quality cells, including those with an unusually low or high number of expressed genes. I am unfamiliar with the general transcriptomic characteristics of brain cells.

What would you consider appropriate thresholds for the number of genes with at least one count (i.e., expressed) in brain cells? Specifically, what are the recommended thresholds for the n_genes_by_counts parameter in the Scanpy pipeline?

python scanpy anndata single-cell • 1.1k views

ADD COMMENT • link updated 7 months ago by LChart 5.0k • written 7 months ago by AlexStar ▴ 180

1

Entering edit mode

Look at the data and filter outliers from the bulk of cells. Or annotate first by crude celltypes and then do the filters per celltype. There is no general threshold. It always depends on your data.

ADD REPLY • link 7 months ago by ATpoint 88k

0

Entering edit mode

ATpoint Yes, but generally speaking, for cells that express, for example, 8,000 genes, does this strike you as unusually high based on your experience, particularly in the context of the brain? I already have cell annotations and have reviewed several plots related to this issue, but with my current knowledge, I am unable to determine whether a group of cells exhibits abnormal gene expression or not. Here is a plot showing n_genes_by_counts against the total counts for the cells, as you see many are above 8,000, so I can't exactly consider them as outliers.

enter image description here

ADD REPLY • link 7 months ago by AlexStar ▴ 180

0

Entering edit mode

The typical worry for "expressing too many genes" is that the cell may be a doublet. You are generally better off using an actual doublet finding program than hard-thresholding on the number of features. You could do something like plotting the fraction of barcodes expressing mutually-exclusive lineage markers (like MOG|MAG|MOBP + AQP4|GFAP|ALDH1L1) as a function of genes-by-counts; but this is basically what doublet algorithms do anyway.

ADD REPLY • link 7 months ago by LChart 5.0k

0

Entering edit mode

LChart I use Scrublet for doublet detection. This plot shows the data after applying Scrublet. Specifically, I removed all cells with a doublet score above 0.25, which is a very strict threshold (typically, a threshold of 0.4 is used based on what I have observed)

ADD REPLY • link 7 months ago by AlexStar ▴ 180

0

Entering edit mode

Unless you have a specific concern, you can likely go ahead and proceed with the analysis. You can generate diagnostic plots of where the different coverage deciles (or genes_by_counts) deciles fall in PC space or UMAP space to see if you likely need to perform additional filtering.

ADD REPLY • link 7 months ago by LChart 5.0k