I want to analyse single-cell RNA-seq data, and I do not know if I should filter genes based on the zero counts in every single experimental condition (what I think is difficult due to the different perspectives that you can analyse on single-cell data, and if you want to analyse other metadata, maybe you could be losing information) and if it is necessary to input data. I do not find clear instructions about what to do with genes filtering and data inputation in single-cell datasets. In my case, my datasets contain integrated datasets that had been previously filtered individually.
Thank you very much for your answer Jared, I understand better now. I also wanted to ask you if you would only remove genes if they show no expression across all cells (or all cells in every experiemental condition), or you would choose a threshold. For example: keeping all genes that are expressed in, at least, in 1% of each experimental condition cells.
I'm usually pretty conservative and keep all genes expressed in more than like 20 cells across the dataset. It's pretty arbitrary.
Is that around a 5-10% of the dataset, no?
Depends on the dataset. I usually still have ~20k+ genes after filtering.
In my case, I have a dataset that is the result of data integration of different datasets where cells have been previously filtered in terms of nfeatures, ncounts and percentage of % mithocondrial DNA individually (different values of these parameters have been used in each dataset). The final count matrix (after data integration) contain around 17k genes. What would you recommend?
Again, it's pretty arbitrary, but a gene found in only ~10 cells is unlikely to be informative if you don't have some ultra-rare population of interest or something.
Pick a conservative threshold and move on, this isn't going to make or break your analysis.