I currently have a heatmap (which is k-means clustered) of the top 1000 most variable genes, but I want to be able to narrow it down to about 100-200. The reason why it's set so high is that when I want to analyse each cluster, I end up filtering out any genes that are not considered significantly differentially expressed genes (e.g. padj < 0.05, log2foldchange < 0 or > 0). However when I do this, I get left with a small amount of genes which isn't ideal for when I'm doing GO-term analysis - hence why I start off with a larger number (like 1000), so that even after filtering there's enough genes in the cluster for me to analyse.
I was wondering if there is a way to filter out the genes beforehand, so that all the genes in the heatmap are already significantly expressed and I won't have to filter them out afterwards - that is, all the genes in the heatmap will actually be used for downstream analyses. Hope that makes sense! Also open to any recommendations and suggestions on how to improve my approach.
Here is my code:
#heatmap
topVarGenes <- head(order(rowVars(assay(rld)), decreasing = TRUE), 1000)
#clustered heatmap
set.seed(1234)
k <- pheatmap(assay(rld)[topVarGenes,], scale="row",kmeans_k = 4)
clusterDF <- as.data.frame(factor(k$kmeans$cluster))
colnames(clusterDF) <- "Cluster"
OrderByCluster <- assay(rld)[topVarGenes,][order(clusterDF$Cluster),]
pheatmap(OrderByCluster,
scale="row",annotation_row = clusterDF,
show_rownames = FALSE,cluster_rows = FALSE)
acute_hi <- rownames(clusterDF[clusterDF$Cluster == 1,,drop=FALSE])
#DEGs
resSigind = res[ which(res$padj < 0.05 & res$log2FoldChange > 0), ]
resSigrep = res[ which(res$padj < 0.05 & res$log2FoldChange < 0), ]
resSig = rbind(resSigind, resSigrep)
#filtering out any genes from the heatmap which are not significantly different (e.g. do not overlap with DEGs)
acutehi_cluster <- resSig[acute_hi,]
Hi António, Thanks for your response. Based on your suggestion I've done this:
However when I run this, I get the gpar fill error - "Error in check.length("fill") : 'gpar' element 'fill' must not be length 0". I understand it's something to do with matching the annotation columns of the matrix to the row names of the rownames of the dataframe, however am not sure how to go about it in my situation.
Hi ccha97,
What I was suggesting was slightly different from what you've tried (I think so).
So, if I understood you retrieve the DEG genes from the
rlog
transformed data, and then you subset this table to the genes that are most variable. Is this correct?If so, I believe the problem is that when you subset the table for DEG only, many/some genes found as top 1000 most variable will not be among the DEG genes, therefore, when you subset the table again to the most variable genes, it gives you that error. Though I'm not sure if the error is related with this problem.
So, try the following to see if the error disappears:
This is what I've suggested in first place. Notice that despite you're retrieving the top 1000 most variable genes, the outside
head()
function will only select the first 10 or so. This essentially will subset yourrlog
data to the DEG genes only, and only then will search among these for the most variable genes. Not sure if you've at least 1000 DEG genes (you need to have since you're specifying the top 1000).Then just follow the same code as you posted in your question. Let me know if this works.
António
Thank you for your well-written explanation - your line of code worked well and it explains what I wanted to do with my data!
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Sorry for re-bringing up this issue, but I was double checking the genes in the cluster and it appears some of the genes on the heatmap aren't actually significant, which means they're still being expressed on the heatmap (I've checked whether their ENTREZ IDs are included in allSig_genes, and they are not). I've also double checked to make sure allSig_genes is correctly defined, and it looks right so it's likely the problem may be coming from somewhere else? I checked assay(rld)[topVarGenes,] and this appears to still have insignificant genes
It seems odd to me.
What is the result of:
António
I agree, I don't think it's selecting from the significant gene list, but it is different from the heatmap I originally had so I'm not sure.
Can you post here the code that you're using, from the first line to the last line?
Yes, now I understand what is wrong here. What is happening is that when you do
assay(rld)[allSig_genes,]
, you are sub setting the data to the significant genes right. So, thetopVarGenes
variable contains not gene ids, but vector positions of the most variable genes among the significant genes. However, when you give again the mainrlog
data, i.e.,rld
object, to thepheatmap
when you filter bytopVarGenes
is not correct because therld
object contains both, significant and not significant, andtopVarGenes
is the positions of the most variable genes among the significant only. So, in practice to solve this problem you might filter therld
object to the significant genes, and only then apply obtain thetopVarGenes
and thepheatmap
on the samerld
object filtered only to significant genes in both cases. In this case,topVarGenes
will be a numeric no. with the positions of the most variable genes to filter, in a table with significant genes only (in this case will match). I created this below asrld_sign <- assay(rld)[allSig_genes,]
. I think now, this code will work. Please let me know, if it works.Run this code:
Thanks António, I think assigning the variable rld_sign makes more sense. The only issue I get when running this code is I get the gpar error again when running the last line (pheatmap).
Sorry for the late reply. Did you see this issue: Pheatmap error: 'gpar' element 'fill' must not be length 0 , what to do about that in an assymmetrical matrix . It might help. It is not clear for me what can be causing this problem.
António