Question

Annotating cell types via integrating a query dataset with a reference dataset and then cluster

0

Entering edit mode

3.3 years ago

Kaz • 0

I have a scRNAseq dataset (Smartseq2-method), below called the query dataset, that I want to annotate the cell types in. I have a good quality reference dataset, also with Smartseq2-methodology. I'm using Seurat mainly for my analysis. Seurat provides a cell type classification tool for this purpose (well described tutorial https://satijalab.org/seurat/articles/integration_mapping.html, and the article https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8). Seurat offers also a tool for integrating datasets and to visualize them in the same UMAP. Problem is that the query cell types classified using the cell type classification tool doesn't always make sense in the integrated UMAP (some classified query cells are next to reference cells of different labels).

One thought that I've had is to integrate the two datasets as above for the integrated UMAP, and then perform the clustering algorithm of Seurat on the integrated data, and classify the query cells based on which reference cell types they end up clustering together with. The result makes much more sense visually on the integrated UMAP. But I have not been able to find any publication or similar using this method as a cell type classification method. And I was wondering why? Is there a big mistake of doing this? In the integration algorithm the query dataset is modified together with the reference dataset, which is does not happen in the cell type classification algorithm, but I'm thinking that the integration process also removes batch effects between the query and the reference dataset, which could be preferable. But I'm also thinking that there must be a good reason why Seurat doesn't suggests this as an alternative for cell type classification.

Thanks for your input.

celltypeclassification scRNAseq • 2.3k views

ADD COMMENT • link 3.3 years ago by Kaz • 0

1

Entering edit mode

SingleR should help with this: https://bioconductor.org/packages/devel/bioc/vignettes/SingleR/inst/doc/SingleR.html

ADD REPLY • link 3.3 years ago by GenoMax 147k

0

Entering edit mode

Thank you for this response, I appreciate the link. However, I was not really looking for an alternative cell type classification tool (I know that there are several out there), merely, I was interested to get some input why my second approach is not used usually.

ADD REPLY • link 3.3 years ago by Kaz • 0

score 5 · Accepted Answer · 2021-08-10

Mostly because it's typically unnecessary given that reference-based classification should yield a similar result without being subjected to potential biases introduced during the integration process.

SingleR (and presumably Seurat, I don't know as I don't use it) uses a reference dataset and asks "Which reference sample's expression profile is this cell's counts most similar to?". And then labels appropriately, assuming the answer to that question is relatively clear - SingleR will refuse to label cells that are very ambiguously scored. In most cases with SingleR, you can ignore any potential batch effects since the question is all relative, assuming you're willing to make the assumption that any effect applies to all cell types relatively similarly. In Seurat's method, it's not really clear to me whether the integrated counts that have been adjusted are being used, but I'm assuming so.

Your method is doing much the same in a more manual way, but making the assumption that all clusters are homogenous cell types, which may or may not be true. Additionally, integration methods have the rather unfortunate side effect of cramming populations together whether they're actually biologically similar or not - they often need parameter tweaks to actually preserve unique populations in my experience. Reciprocal PCA methods (which Seurat supports) are generally more conservative and have a softer touch, so you could consider trying that if you feel this may be occurring.

I can't speak to why your dataset may not be performing well with Seurat's methods, though the main concern off the top of my head is that your query dataset contains cell types not found in the reference dataset. In such cases, those cells may still be labeled, just incorrectly. I don't know if you can get the full score matrix out of Seurat for each cell and potential label, but if so, a closer look at that could indicate which cell types are really causing issues in your data.