Hi guys,
My aim is to identify Differentially expressed genes from GEO183837. This dataset contains Single-cell transcriptome profiling of the human endometrium from patients with recurrent implantation failure, specifically 3 control samples and 6 RIF (diseased) samples.
I read all samples into seurat objects, merged all 9 samples. I then proceeded to do Quality filtering based of mitochondrial RNA percentage counts, and normalized the merged data, ran a PCA, and a UMAP. I checked for batch variation and observed technical variation across the 9 samples. I fixed this by splitting the data by Patients, and Normalizing, finding variable features before integrating the data by the FindIntegrationAnchors
function. I again did a UMAP and the batch effects were gone.
Now, I have a integrated Seurat dataset containing 9 samples. I have annotated my metadata such that I have a Type column with dis and ctrl (for diseased and control) as possible entries. I also have a Patient column with pat1, pat2...pat9 as entries.
My aim is to find Differentially expressed genes between Healthy and Diseased (2 conditions). I don't have cell annotations, so the approach given in the documentation example is hard to follow for me. Here is what I was thinking might work:
seurat.integrated <- SetIdent(seurat.integrated, value=seurat.integrated@meta.data$Type)
Differentially_expressed<-FindMarkers(seurat.integrated, ident.1="ctrl",ident.2 = "dis",
logfc.threshold = 0.25, test.use = "wilcox")
Logic: Directly brute force the FindMarkers
function to find differentially expressed genes across diseased and control conditions.
Is this the right approach, or what else can I do to get the right genes?
Best,
Prietto
You can simply do
Idents(seurat.integrated) <- "Type"
instead ofseurat.integrated <- SetIdent(seurat.integrated, value=seurat.integrated@meta.data$Type)
. UsingFindMarkers
function to find differentially expressed genes across diseased and control conditions is alright but pseudobulk approach is highly recommended to find DE genes. See this and this.Thank you. I tried doing a pseudobulk analysis by doing the following:
I have 9 patient ids in a column
Patient
in my metadata, and a condition (dis or ctrl) in aType
column in my metadata.I ran the below code to get an AggregateExpressionMatrix
Next, I manipulated the resultant matrix to a format compatible with DESeq2 by using the below code:
This gave me a matrix cts with 9 patient ids as column names and all genes as row labels. I also generated a colData matrix with patient ids as rows and condition as column. Now, I fed this data to a DESeq object.
I filtered the genes using a adjusted pvalue cutoff of 0.05, but only a single gene is being reported. I think I must have messed up the analysis somewhere, but am confident about the processing and integration of the dataset. Where have I gone wrong/what should I do differently?
Best, Prietto
Your code
keep <- rowSums(counts(dds)) >=10
is not correct, please change it to-This requires genes to have B or more samples with counts of A or more. It therefore filters out genes that have less than B samples with counts of A or more.