I recently received feedback that my integrated dimension reduction plot clustering looked problematic. Specifically, the small clusters peripheral (splash/star?) and the number of distinct clusters.
My clusters were called at 40 PC's w/ 0.6 resolution.
As for the number of clusters, TCR B VDJ subgenes were identified as strong conserved markers in several clusters. I wonder if it is worth excluding VDJ markers from analysis?
Any comment on the appearance of the dim plot and implications would be appreciated. Thank you!
Is this an integrated dataset? Did you run the Seurat integration routine? Otherwise it is almost certain that much of the cluster separation is due to the batch effects between the samples and time points.
I recently received feedback that my integrated dimension reduction plot clustering looked problematic. Specifically, the small clusters peripheral (splash/star?) and the number of distinct clusters.
Thank you ATpoint. This is integrated data. I used SCtransform. Samples 1 and 2 were replicates of the same time point. I have included the code below.
split_seurat <- SplitObject(seurat_phase, split.by = "sample")
split_seurat <- split_seurat[c("samp1_rep1","samp2_rep2","samp3")]
for (i in 1:length(split_seurat)) {
split_seurat[[i]] <- SCTransform(split_seurat[[i]], vars.to.regress = c("celldif","mitoRatio"))
}
saveRDS(split_seurat,file= "split_seurat.rds")
##Second script###
split_seurat <- readRDS("split_seurat.rds")
integ_features <- SelectIntegrationFeatures(object.list = split_seurat,
nfeatures = 3000)
# Prepare the SCT list object for integration
split_seurat <- PrepSCTIntegration(object.list = split_seurat,
anchor.features = integ_features)
# Find best buddies - can take a while to run
integ_anchors <- FindIntegrationAnchors(object.list = split_seurat,
normalization.method = "SCT",
anchor.features = integ_features)
# Integrate across conditions
seurat_integrated <- IntegrateData(anchorset = integ_anchors,
normalization.method = "SCT")
As far as the appearance of the plot. I am paraphrasing the feedback, since I was confused. I think the expectation one less delineation between fewer clusters, and less separation between clusters. Also, that cluster 7 has satellite clusters.
To me the sporadic clustering reminds me of using clonotype edit distance for dimensional reduction - I would consider removing the TCR genes not from the anchoring, but from the runUMAP() call. Below is an example of the problem I encountered trying to convert TCR edit disance into an assay for a Seurat Object.
You can do this with:
quietTCRgenes <- function(sc) {
unwanted_genes <- "TRBV*|^TRBD*|^TRBJ*|^TRDV*|^TRDD*|^TRDJ*|^TRAV*|^TRAJ*|^TRGV*|^TRGJ*"
if (inherits(x=sc, what ="Seurat")) {
unwanted_genes <- grep(pattern = unwanted_genes, x = sc[["RNA"]]@var.features, value = T)
sc[["RNA"]]@var.features <- sc[["RNA"]]@var.features[sc[["RNA"]]@var.features %!in% unwanted_genes]
} else {
#Biocondutor scran pipelines uses vector of variable genes for DR
unwanted_genes <- grep(pattern = unwanted_genes, x = sc, value = T)
sc <- sc[sc %!in% unwanted_genes]
}
return(sc)
}
seuratObj <- quietTCRgenes(seuratObj)
Thank you! I also arrived at the at this conclusion. I ended up removing TCR before performing normalization/integration (code below). I performed this filtering upfront, I also found this paper- which removed VDJ genes at the find variable genes step (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6689255/pdf/nihms-1531727.pdf) Any comment on the best step to do this filtering would be appreciated. Thanks!
Oh interesting, so instead of just removing the genes from the variable gene list in Seurat, you just removed the V gene from your counts. It accomplishes the same goal and the UMAP you have looks good. I am wondering what the effect on the integration step would be - the V/D/J genes are generally heavily represented in the variable genes for integration for T single-cell data sets.
In my experience with BCR, I actually found the removal of VDJ genes from the variable list smoothed the UMAP, but did not prevent the clonal groups from clustering together. At the time , it made me think that there is a high degree of overlap in feature space between members with in a single clonotype. But I am not sure if there is much work on that in the field - the only one that comes to mind is CoNGA that is using both expression and clonotype for embedding.
Is this an integrated dataset? Did you run the Seurat integration routine? Otherwise it is almost certain that much of the cluster separation is due to the batch effects between the samples and time points.
What does that mean, please elaborate?
Thank you ATpoint. This is integrated data. I used SCtransform. Samples 1 and 2 were replicates of the same time point. I have included the code below.
As far as the appearance of the plot. I am paraphrasing the feedback, since I was confused. I think the expectation one less delineation between fewer clusters, and less separation between clusters. Also, that cluster 7 has satellite clusters.