Hi Everyone!!
Background: I got a unique single cell dataset produced in house, where there is low number of genes expressed (initial stage of developmental cycle). We have control and drug treatment scRNASeq samples and post differential expression analysis, I wish to perform GO-Term analysis. However, I am confused about how to create an appropriate background (Universe Gene-set) and an appropriate methodology to carry out GO Term analysis
Method 1 for Background creation
Find genes which are expressed even in 1 cell in both control and treatment and then take union of Gene IDs.
Method 2 for Background creation
Find genes which are expressed in either 10% of cells in Control or 10% of cells in treatment. For this I am trying to calculate proportion of cells expressing each gene in each sample i.e. in ctrl and treatment separately using this function:
per.gene.per.sample.pct <- function(sobj,sample_col){
## Making the table summarising number of cells in each sample
i=sobj@meta.data %>% plyr::count(sample_col)
rname=rownames(sobj[["RNA"]]@counts)
## Calculating the percentage of cells where the gene is expressed
## in each sample
rowsum.exp <- function(x){
rowSums2(sobj[,sobj@meta.data[,sample_col]==i[,1][x]][["RNA"]]@counts>0)/i[,2][x]}
tt <- 1:nrow(i) %>% purrr::map(function(x) rowsum.exp(x))
tt<- t(plyr::ldply(tt))
rownames(tt) <- rname
colnames(tt) <- i[,1]
return(tt)
}
Besides, for GO Term analysis should I combine all Differential genes from all the clusters and then carry out GO term analysis or should I separately carry out GO Term analysis for each cluster.The aim is to find out eventually which pathways got perturbed due to treatment and not the cell types in single cell data.
PS: When I try Method 2 for background creation, It leaves me with very few genes as compared to number of DE genes that I found using Seurat!! I understand this since the DE genes are calculated on per cluster basis and every gene need not be necessarily expressed in every cluster. But then does it mean that I should make separate background geneset based on genes expressed in each cluster?
On the other hand, Method 1 gives me exorbitant number of genes and I think that's not ideal either.