I have been asked to analyze a single cell dataset as follows:
Mouse; WT vs Fib4 mutant ; 4 samples (2+2), 1xM+1xF in each genotype.
2 sequencing runs with 10x, superloading the runs with 2 biological samples each using a hashtag antibody:
- run #1:
- sample 1: Male; Fib4 mutant
- sample 2: Female ; WT
- run #2:
- sample 3: Female; Fib4 mutant
- sample 4: Male ; WT
- run #1:
Unfortunately, the antibody reaction didn't work well after sequencing (it looked fine in the wet lab pre-sequencing, though). I've been left with 2 "samples" (12k + 6k cells) that combine cells of both sexes and both phenotypes. I am trying to salvage what I can from the analysis.
As each run was composed of one Male and one Female sample, I used the expression level of sex-specific genes (F: Xist vs M: Ddx3y, Eif2s3y, Kdm5d, Uty) to classify cells into each sample-of-origin. I used the raw counts of both groups of genes and classified them into a sex if they had >0 reads. This resulted in 54% sex-classified cells for run #1, and 69% for run #2
run | both | F | M | none |
#1 | 130 | 2434 | 4187 | 5387 |
#2 | 95 | 3273 | 954 | 1796 |
Knowing this (VERY IMPERFECT) classification, I was able to assign a genotype to those cells.
From there, I merged both sequencing runs into a single Seurat object with 18k cells classified by genotype:
Fib4 | WT | NA |
7460 | 3388 | 7408 |
The Fib4 mutants are a mouse model of frailty. Because of this, either the tissue/cell composition of the original samples, or the ability of cells to survive tissue dissociation is different between genotypes. After a first round of naive clustering, I can see clear differences in the abundance of WT/Fib4 cells on several clusters
And there are several clusters dominated by cells with no assigned genotype
Because of this, I am trying to find some way to classify (as many as possible of) the NA cells into one of the two genotypes. What I have tried up to now is:
- Select the largest cluster with the highest number of both WT and Fib4 cells (cluster 0).
- Run FindMarkers on the cluster to detect markers that can distinguish the two genotypes.
- Use the top up and down markers to create a WT.score and Fib4.score and run AddModuleScore with those gene lists on the whole dataset.
- Classify cells according to those 2 scores.
ss<-subset(seu,subset = seurat_clusters == 0) Idents(ss) <- "gt" # gt_cl0_markers <- FindMarkers(ss, ident.1 = "Fib4", ident.2 = "WT" ) gt_cl0_markers <- FindMarkers(ss, ident.1 = "Fib4", ident.2 = "WT", logfc.threshold = 0.25, test.use = "roc", only.pos = F)
gt_cl0_up <- rownames( gt_cl0_markers[gt_cl0_markers$avg_log2FC > 0 ,] %>% top_n(5, power) ) gt_cl0_down <- rownames( gt_cl0_markers[gt_cl0_markers$avg_log2FC < 0 ,] %>% top_n(5, power) )
gt_cl0_markers$dir <- ifelse(gt_cl0_markers$avg_log2FC >= 0, "up", "down")
ss <- AddModuleScore(ss, features=list(gt_cl0_up), name="seu_fib4_cl0_up", assay="RNA", slot="data") ss <- AddModuleScore(ss, features=list(gt_cl0_down), name="seu_fib4_cl0_down", assay="RNA", slot="data")
ggplot(md, aes(x=seu_fib4_cl0_up1, y=seu_fib4_cl0_down1)) + geom_abline(slope=1,linetype="dashed")+ geom_hline(yintercept = 0,linetype="dashed")+ geom_vline(xintercept = 0,linetype="dashed")+ geom_point(alpha=0.25, aes(color=gt)) + facet_wrap(~gt)+ theme_minimal() + theme(aspect.ratio = 1) + ggtitle("Genotype") + guides(color=guide_legend(override.aes = list(alpha=1)))
Unfortunately, these scores created from the cluster/genotype markers doesn't seem able to classify much:
And they don't even classify much when applied only on the very same cluster used to find the markers:
I have tried this using both the default method to FindMarkers and the test.use = "roc"
one. Using the top 5, 20, and 50 markers in each direction.
Is this the right way to infer the genotype/grouping of cells with missing information? Am I doing something wrong? How should I classify cells based on these differentially expressed genes? Am I doing everything fine but I am out of luck with these samples?
Thanks, Txema