Prior to analyzing single cell RNAseq datasets, I typically employ a pipeline where I look at QC metrics, apply some hard filter, cluster the data, then re-examine the QC metrics, and update the hard filter.
I'm now wondering whether I should just throw out an entire crappy cluster rather than updating the filter, as making the filter more stringent will throw out cells from the "good" clusters.
Which got me thinking... should I even be applying hard filters if computational speed (and personal time) is not limiting? Why not just iteratively cluster and cull?
Curious to know what other people's thoughts and strategies are :) Thanks in advance!
I agree with Ram's points. Computational efficiency is not the reason for filtering; the data quality is. To further discuss your strategy, could you define "crappy cluster" and elaborate on your grounds?
Given your description, you should only discard the cluster if 100% of the cells are below your hard filter. Otherwise, you are removing a group of cells different from the remains and losing biological meanings. Depending on the clustering algorithm you choose, removing a cluster with good-quality cells might exaggerate the difference between the remaining cells.
Also, for the standard QC filter, you want to remove (1) duplets/triplets and empty droplets and (2) potentially cells with high mitochondrial/ribosomal reads. I am unsure why you need to adjust the complex filter for (1) because they should be removed. In the case of (2), you might have a group of cells with high mitochondrial/ribosomal reads because those cells are under stress or dying, which is biologically relevant to your experiment.
Thanks - I really appreciate the reply. I realize that we set the hard filter for quality as well. I guess in my usual workflow, the "hard" filter is usually not fully determined a priori - I usually only set filters after looking at the distributions anyway. I usually do this for all QC metrics, particularly number of features/cell and % mitochondrial reads. That's why I'm wondering whether it makes more sense to take a completely data-driven approach - let the poor QC clusters declare themselves.
To elaborate - what I've started testing out is performing no QC filtering up front, and then pre-processing/ clustering the data. I notice that when I plot total counts/barcode and number of features/barcode, there already appear to be two clusters . I think I see a population that should be filtered out, but hard to do it with simple hard thresholds.
I find that clusters with low numbers of features or high % mitochondrial reads usually largely cluster together in a fairly significant way if I do a simple clustering workflow without pre-filtering. I do see some overlapping tails between clusters but biologically, I think that makes sense. However, not sure if people agree which is why I decided to post here :)
Fundamentally, it comes down to idea that single cell experiments profile heterogeneous populations of cells. Why would "poor quality"/dead cells need to follow hard thresholds across cell types? It makes sense that there might be overlap. Since we use the data to guide our thresholding decision anyway, there appears to be no "absolute truth" to the threshold that constitutes a good vs bad cell. So why not let the data decide for us?
What do you think?
I apologize for the delayed response. I understand where you're coming from. While I lean towards consistently implementing pre-filtering, I'm open to discussing alternatives.
In my view, no preprocessing method is entirely foolproof. However, I believe that basic quality control (QC) is essential. The steps chosen should align with the specific research question at hand. Generally, I'd advocate for the removal of empty droplets and potential doublets or triplets. Whether or not to filter by the percentage of mitochondrial/ribosomal reads largely depends on the nature of the inquiry. Establishing firm filters for each sequencing result, especially for determining the upper limits of nFeature and nCount, should be based on observed data distributions.
Given the intricacies of single-cell data, it's crucial to adopt logical steps to prepare it for subsequent analysis. Consider the following hypothetical scenario:
Imagine a set of highly compromised cells, termed Group A, characterized by elevated mitochondrial reads and diminished total reads. Contrast this with a set of healthy cells, Group B, that exhibit low mitochondrial reads and high total reads. Then, there's Group C, cells reacting to a specific treatment, also showing low mitochondrial reads and high total reads. Let's assume the existence of a theoretical gene, Gene X, consistently expressed in viable cells. Without filtering out the dying cells, once normalized with total reads, Gene X could erroneously appear among highly variable genes, skewing the analysis and misrepresenting biological significance.
In essence, my argument emphasizes the necessity of preprocessing. Raw data can be overwhelmingly chaotic. By judiciously refining raw data, we pave the way for downstream analysis tools to function as they designed for.
AFAIK, clusters group cells by biological similarity, not computational quality. If a cluster were to contain mostly necrotic cells, you may be able to discard that cluster but understand that you're doing it for biological reasons.
Throw out cells from "good" clusters = ignore possibly insignificant GEMs from clusters, which is not a bad thing. Don't start throwing out clusters unless you have an idea why the cluster clustered together.
Thanks - totally agree on all these points. I was thinking of re-examining cluster-level QC metrics to determine whether to throw them out. And yes - this assumes that there are biologic features that correlate with barcode quality. So that if have a few cells with "good" metrics but that cluster along with barcodes with obviously bad metrics, these are likely bad. Similarly, borderline barcodes that cluster with "good QC" barcodes may be biologically "good" cells that, for whatever reason, don't quite meet the thresholds I've set.
Does this seem reasonable?
I'll wait for other to chime in, but your approach seems a little restrictive to me.