Entering edit mode
24 days ago
JACKY
▴
170
I am conducting a Scanpy analysis on human brain snRNA-seq data. As part of the quality control process, I need to filter out low-quality cells, including those with an unusually low or high number of expressed genes. I am unfamiliar with the general transcriptomic characteristics of brain cells.
What would you consider appropriate thresholds for the number of genes with at least one count (i.e., expressed) in brain cells? Specifically, what are the recommended thresholds for the n_genes_by_counts
parameter in the Scanpy pipeline?
Look at the data and filter outliers from the bulk of cells. Or annotate first by crude celltypes and then do the filters per celltype. There is no general threshold. It always depends on your data.
ATpoint Yes, but generally speaking, for cells that express, for example, 8,000 genes, does this strike you as unusually high based on your experience, particularly in the context of the brain? I already have cell annotations and have reviewed several plots related to this issue, but with my current knowledge, I am unable to determine whether a group of cells exhibits abnormal gene expression or not. Here is a plot showing
n_genes_by_counts
against thetotal counts
for the cells, as you see many are above 8,000, so I can't exactly consider them as outliers.The typical worry for "expressing too many genes" is that the cell may be a doublet. You are generally better off using an actual doublet finding program than hard-thresholding on the number of features. You could do something like plotting the fraction of barcodes expressing mutually-exclusive lineage markers (like MOG|MAG|MOBP + AQP4|GFAP|ALDH1L1) as a function of genes-by-counts; but this is basically what doublet algorithms do anyway.
LChart I use Scrublet for doublet detection. This plot shows the data after applying Scrublet. Specifically, I removed all cells with a doublet score above 0.25, which is a very strict threshold (typically, a threshold of 0.4 is used based on what I have observed)
Unless you have a specific concern, you can likely go ahead and proceed with the analysis. You can generate diagnostic plots of where the different coverage deciles (or genes_by_counts) deciles fall in PC space or UMAP space to see if you likely need to perform additional filtering.