Is there a more definitive way to decide on QC cutoffs for SC-RNA-SEQ? I understand low read counts (incomplete capture of the cell) and high read counts (potential doublet/multiplet) are an issue, but I've seen a lot of variation in the thresholds used throughout all of the different learning material and in supplementary materials sections of papers.
For example, see the violin plots below showing total feature counts for each sample in this study - while this screenshot was grabbed from my own analysis, the methods section of the paper that produced the raw data implied they removed all cells with feature counts <`1000 - but this seems like a very large portion of the cells from this study? (and is higher than what i generally see). They also set a mitochondrial cutoff of 20% mitochondrial genes which seems quite high, although this disease is known to reduce mitophagy so that could be a rationalization for that.
I've looked through the other types of plots (scatter, etc.) and can't really see a definitive trend & this seems rather arbitrary - am I missing something? Or is there a better visualization method I should be using to determine cutoffs? I realize I can choose some +- X standard deviations as a cutoff but that also seems somewhat arbitrary.
Thanks for any help!
In general I am against the use of hard cutoffs. It is rather the overall distribution of all cells and for each sample that should dictate cutoffs. 5% for reads aligning to mitochondrial genes is often used but hetergeneity towards the celltype may require a higher cutoff as you say. The same goes for cutoff of total read (or UMI) count and detected features. This plot is indeed heterogeneous and I wonder why that is. The simplest explanation could be that samples had highly uneven sequencing depth and some samples simply have large dropouts. It would be necessary to get some details on how you processed these data though. Is this 10X data? I generally like to perform QC (and thresholding) on each sample separately, here it really looks heterogeneous.
By the way, the image renders properly if you provide the full path to the image (incl. the suffix), e.g. by right-clicking the image at the hoster website "open in new tab" and then copy the link, something like this. I edited the link here.
What is shown in the image was simply VlnPlots after creating a Seurat object with the raw GEO data, no QC or normalization applied yet.
This was from a 10X platform, but I cannot give any information on pre-processing that was done by the researchers, etc. - here is the omnibus link that I fetched the data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135893
I could also create a similar plot with total counts if that would be useful. All of the lung sc-rna-seq data I have seen is much messier than standard PBMC datasets, etc. I attributed this to the experiment difficulty in attaining biopsy & extracting cells from this particular tissue type. I also thought the high spread/heterogeneity could be due to the extremely diverse set of cell types in the lung & intra-patient differences (but again this is just my guess as I'm more of an experimentalist than a bioinformatician)
any additional advice?
edit: my general problem is that the various online resources all seem to pick an arbitrary threshold for QC (Specifically for the minimum features metric), whereas there seems to be more resources for mitochondrial QC and doublet detection, etc.