Hi, I am new to bioinformatics and I have a few silly questions about how to define thresholds for QC metrics in Seurat.
I noticed that Seurat filters cells that have unique feature counts over 2,500 or less than 200. And I have heard that we need to apply a different QC values for each dataset, by looking at the distribution of the number of genes and the distribution of UMIs.
Then, I guess, applying a different threshold for the maximum number of genes to each data means that "the high gene count" is defined relatively for each data. For example, nFeature_RNA < 2000
is a threshold for the max value in A dataset, but nFeature_RNA < 3000
is a threshold for the max value in B dataset (i.e 2000 is not considered as high in B data).
But, I don't understand why we need to set a different upper limit for the number of genes in each data. Is something biologically wrong with the high gene count?
As a non-biology major, I don't understand why we change the upper limit for nFeature (genes) and nCount (UMIs) for each data.
I know this is pretty silly.. but if anyone can explain about the biological concept behind the numbers, I would really appreciate it. Thank you very much.
Wow.. This explanation is just amazing... I clearly understand now.. Thank you so much for the details, Jared.
You're quite welcome. And this question isn't silly at all, particularly given your background. You'll find well-written, clear questions like yours here will often be answered quickly and thoroughly. And now others with the same question will easily be able to find an explanation - this is the entire point of Biostars.
As an aside, you may also find the Orchestrating Single Cell Analysis book by some of the Bioconductor folks very helpful, even if you don't use any Bioconductor packages. It has some very useful contextual info that helps explain many analysis steps like this. It's well-written and easy to understand with many code examples that you can run directly if you'd like.
beautifully explained
This was quite helpful. I was wondering does 2500 feature counts equate to genes expressed by a cell. And if so, I thought the maximum number of genes that can be expressed by a potential cell was much higher? ( this might be a dumb question)
This isn't a dumb question, and it's actually pretty important to understand. nFeature does equate to the number of genes detected in a given cell. The number of genes expressed likely is much higher, but one must keep in mind the relatively low number of reads per cell in 10X experiments (50k is a common target). This results in many more lowly expressed transcripts not getting sequenced at all just because of the stoichiometry involved. 10X scRNA-seq is quite sparse and discrepancies are compensated for by the high number of cells profiled, the idea being that it's no big deal if you don't capture a gene in one cell, as you'll (hopefully) catch it in another. Many genes only have 1-2 reads mapped to them for a given cell. Ultimately, you end up comparing clusters/populations of cells rather than individual cells to each other, so this is okay.
Other single cell platforms (e.g. SmartSeq2) tend to more faithfully capture the transcriptome of each cell, but the tradeoff is that they profile a much lower number of cells (a few hundred).
thank you!