Hi,
I'm analyzing a PBMC single cell dataset with 3 diseased samples and 5 healthy samples. I've integrated the samples, subset out clusters I'm not interested in and now and I'm trying to annotate the clusters using a combination of azimuth and known marker genes. After subset, I typically normalize, find variable genes and scale the data before reclustering it. Now in this subset, there is one population on B-cells that seems to come from one sample (not 1 condition, just one sample). I've tried looking at it a few different ways - not normalizing after subset, SCTransform after subset. Just to see if any of them will keep that one sample from clustering separately. But it keeps happening. It has gotten me questioning about the right way to process integrated data after subset. I've looked into Seurat's github page and general consensus seems to be to either SCTransform or at least, normalize and scale after subset. So now I'm not sure if what I'm seeing is really a batch effect or its just biological - that the Bcell population does indeed come from that one sample alone. Thoughts ?
Is the proportion of your healthy samples homogeneous in all your healthy clusters ? Is the proportion of your disease samples homogeneous in all your disease clusters (except the troublesome one) ? Now focusing on your disease samples, are there some metadata that could discriminate a disease sample from the 2 others (age, sex, date of sample collection...) ? What are the markers driving the B-cells population and could they make sense in your disease context ?
Yes, I'm getting a reasonable distribution of healthy and diseased cells within all cell clusters (celltypes) except for this one cluster that is entirely made up of one sample. Looking back at the metadata I have, nothing jumps out as odd with the sample, although I do have to go back and consult with the person who did the experiment. I'm not familiar with celltypes in the PBMCs and I'm relatively new to this kind of data. Azimuth annotated this population as B-cells but based on discussions with my group, the markers are not specific to the disease.
What are the marker genes for this specific population ? Do you have an immune cell specialist in your group to help you annotate this population ? Could you check the expression of genes related to stress and hypoxia ?
Yes, I will discuss that with the group. We do have people who understand the biology better. Thank you.
What do quality metrics look like for that cluster? Are they better/worse than the others?
I did look at the nFeature, nCount and mito content. I had QC'd each sample before integration. That sample had higher cell count than all other samples and slightly higher mitocondrial content, that I filtered for.