Question

single-cell zeros handling, inputation and filtration

0

Entering edit mode

11 weeks ago

frarodmar17 • 0

I want to analyse single-cell RNA-seq data, and I do not know if I should filter genes based on the zero counts in every single experimental condition (what I think is difficult due to the different perspectives that you can analyse on single-cell data, and if you want to analyse other metadata, maybe you could be losing information) and if it is necessary to input data. I do not find clear instructions about what to do with genes filtering and data inputation in single-cell datasets. In my case, my datasets contain integrated datasets that had been previously filtered individually.

zeros single-cell • 794 views

ADD COMMENT • link updated 11 weeks ago by jared.andrews07 ★ 18k • written 11 weeks ago by frarodmar17 • 0

score 2 · Answer 1 · 2025-01-15

2

Entering edit mode

11 weeks ago

jared.andrews07 ★ 18k

I have found single gene imputation to be pretty unreliable. Multiple benchmarking studies tend to conclude the same.

From the first:

In addition, we found that while some imputation methods improve detecting differentially expressed genes or discovering marker genes, they also can introduce false positive signals, sometimes driven by imbalanced cell numbers between groups (e.g., Additional file 1: Figure S2i-j). The magnitude (i.e., effect size) of differential expression (i.e., log-fold change) plays a role in the performance of the imputation methods. Most imputation methods strengthen large effect sizes compared to no imputation. However, if the original expression difference is small, then most imputation methods may smooth away the small differential signal and hence do not show clear advantage over not imputing (Fig. 3j, k).

From the second:

The results revealed that the performance of different methods varied across different datasets, suggesting that imputation may have dataset specificity. In particular, based on the experiments evaluating downstream analysis, real datasets were barely improved by most imputation methods.

Imputation/smoothing based on nearest neighbors on gene signatures for a population feels less gross and likely more reliable. I'd avoid this rabbit hole unless you have a strong reason to go down it.

Generally, it's fine to remove genes with no expression across all cells.

ADD COMMENT • link 11 weeks ago by jared.andrews07 ★ 18k

0

Entering edit mode

Thank you very much for your answer Jared, I understand better now. I also wanted to ask you if you would only remove genes if they show no expression across all cells (or all cells in every experiemental condition), or you would choose a threshold. For example: keeping all genes that are expressed in, at least, in 1% of each experimental condition cells.

ADD REPLY • link 11 weeks ago by frarodmar17 • 0

0

Entering edit mode

I'm usually pretty conservative and keep all genes expressed in more than like 20 cells across the dataset. It's pretty arbitrary.

ADD REPLY • link 11 weeks ago by jared.andrews07 ★ 18k

0

Entering edit mode

Is that around a 5-10% of the dataset, no?

ADD REPLY • link 11 weeks ago by frarodmar17 • 0

0

Entering edit mode

Depends on the dataset. I usually still have ~20k+ genes after filtering.

ADD REPLY • link 11 weeks ago by jared.andrews07 ★ 18k

0

Entering edit mode

In my case, I have a dataset that is the result of data integration of different datasets where cells have been previously filtered in terms of nfeatures, ncounts and percentage of % mithocondrial DNA individually (different values of these parameters have been used in each dataset). The final count matrix (after data integration) contain around 17k genes. What would you recommend?

ADD REPLY • link 11 weeks ago by frarodmar17 • 0

0

Entering edit mode

Again, it's pretty arbitrary, but a gene found in only ~10 cells is unlikely to be informative if you don't have some ultra-rare population of interest or something.

Pick a conservative threshold and move on, this isn't going to make or break your analysis.

ADD REPLY • link 11 weeks ago by jared.andrews07 ★ 18k

score 1 · Answer 2 · 2025-01-15

Imputation is gene expression distribution specific, it is a way to "average" existing gene expression to smooth the sparsity of single cell sequencing.

A tool like MAGIC is based on clustering. If a cellA has 0 expression for gene1, it will average its expression from the surrounding cells (B, C, D...) to impute a new expression of gene1 in cellA.

If gene1 does not have any expression in the surrounding cells either, it is impossible to impute a new expression of gene1 in cellA or you will have to increase the number of neighbors.

I would say you can remove genes with no expression in ALL cells before running your imputation.

Here, some more reading