Question

Cannot differentiate between groups with heatmap

0

Entering edit mode

3.0 years ago

bart ▴ 50

I'm using deseq2 for DEA but when I create a heatmap with only DEGs, it is not possible to differentiate between people from 2 different groups (cancer vs controls). I've used the following lines in R:

#create deseq object, normalize, use pre-filtering to remove genes with <5 counts in >90% of samples
dds<-DESeqDataSetFromMatrix(df,colData =group, design = ~group)
dds <- estimateSizeFactors(dds)
badgenes<-names(which(apply(counts(dds), 1, function(x){sum(x < 5)}) > 0.9 * ncol(dds)))
ddsFiltered <- dds[which(!rownames(dds) %in% badgenes), ]
#perform deseq analysis, prevent deseq from inserting p-adj values which are NA, insert p-adj values, subset all DEGs 
ddsFiltered<-DESeq(ddsFiltered)
res<-results(ddsFiltered, cooksCutoff=FALSE, independentFiltering=FALSE)
filtered<-counts(ddsFiltered) 
filtered<-as.data.frame(filtered)
filtered<-filtered%>%mutate(padj=res$padj)
all_diff_genes <-subset(filtered,filtered$padj<0.05)
#create heatmap with only the DEGs
rld <- vst(ddsFiltered, blind=FALSE)
de<- rownames(res[res$padj<0.05, ])
de_mat <- assay(rld)[de,]
pheatmap(t(scale(t(de_mat))),show_rownames = F,show_colnames = F,annotation_col =group)

I thought that it might be caused by the fact that factors such as age and gender are not present in the design formula, however, adding these does not improve clustering and reduces the amount of DEGs that are found from 2k using only 'group' (=cancer or healthy) to 32 DEGs. Also, clustering does not improve by adding factors. At this point I'm out of ideas of what could cause this clustering problem. Does anyone have an idea? Thanks in advance!

enter image description here

deseq2 heatmap • 2.5k views

ADD COMMENT • link updated 2.7 years ago by ATpoint 86k • written 3.0 years ago by bart ▴ 50

score 2 · Answer 1 · 2022-01-11

2

Entering edit mode

3.0 years ago

swbarnes2 14k

I'm not sure you have a clustering problem. I think your samples really do not cluster by type. Sometimes that's what the data is.

ADD COMMENT • link 3.0 years ago by swbarnes2 14k

1

Entering edit mode

Exactly. Sometimes the results don't support the hypothesis, no matter how much data torturing we apply.

ADD REPLY • link 3.0 years ago by Mensur Dlakic ★ 28k

score 2 · Answer 2 · 2022-01-12

It looks to me like there is definitely another source of variation in the data that is driving the clustering. In the below image, I have divided the samples into 7 clusters (labeled 1-7) based on the breaks visible in the heatmap, and the genes into 4 clusters (A-D).

Image shows a clustered heatmap with row and column dendrograms. The columns and rows are divided into clusters by hand drawn red lines

Note that each group is either predominantly group 1 or group 2. So your samples are clustering to some extent by group. For example Group A genes are generally higher in the predominantely group 1 samples than the adjacent group 2 samples: A is higher in cluster 2 than cluster 3, and higher in cluster 4 and 6 then cluster 5 (and possibly 7). But the samples are also clustering into by something else, which I've labeled supercluster I (clusters 2 & 3 ) and supercluster II (clusters 5, 5, 6 and 7). So cluster A genes are generally higher in I than they are in II. Cluster B genes are also higher in supercluster I compared to supercluster II, but within each supercluster are higher within group 1 samples than group 2 samples (i.e .cluster 2 vs cluster 3 and cluster 4&6 vs cluster 5&7). Genes in cluster D seem not to DE by group at first glance, being uniformally low in supercluster I and high in supercluster II. They are probably DE because of an imbalance of group 1 and group 2 samples between the two superclusters.

If you look at your PCA, while group 1 and group 2 do seem to separate on axis 2 somewhat, you appear to have two groups of group 1 samples separating on axis 1.

All this suggests to me that you have some sort of confounding factor in the data. Might be a batch effect, or a cancer subtype? Or perhaps the patients are from two or more different populations?

The other take away, is that looking at that heatmap, I'd be pretty confident that gene groups A and B really were DE between cancer and normal, but I might need more convincing for groups C and D.

score 1 · Answer 3 · 2022-01-11

1

Entering edit mode

3.0 years ago

shiyang_bio ▴ 170

I'd suggest to use only group as coefficient for Deseq2. And when you select DE genes, considering padj (<0.05) and logFC (e.g. absolute value > 1) is better than considering padj only. You can test different cutoffs. I guess when using only top 200-500 genes will generate a heatmap looks better.

ADD COMMENT • link 3.0 years ago by shiyang_bio ▴ 170

0

Entering edit mode

thanks for your suggestions! I will try them and report back if I can see improvements

ADD REPLY • link 3.0 years ago by bart ▴ 50

i.sudbery · Answer 4 · 2022-01-12

1

Entering edit mode

3.0 years ago

ATpoint 86k

One thing why you have poor "visual" separation is because the color scale even after standardization is still dominated by outliers. Try (for visualization) trim the data back using quantiles, e.g. scale all values beyond its 99th percentile back to the 99th percentile, and all below the 1st percentile to exactly the 1st percentile.

I usually use:

scale_by_quantile <- function (Counts, lower = 0, upper = 1){

    if (class(Counts)[1] != "matrix" & class(Counts)[1] != "data.frame") {
      stop("Counts must be a matrix or data.frame")
    }
    if (lower == 0 & upper == 1) 
      return(Counts)
    if (class(Counts)[1] == "data.frame") {
      cnames <- colnames(Counts)
      Counts <- as.matrix(Counts)
      colnames(Counts) <- cnames
    }
    if (upper < 1) {
      qt.upper <- as.numeric(quantile(Counts, upper, na.rm = TRUE))
      Counts[Counts > qt.upper] <- qt.upper
    }
    if (lower > 0) {
      qt.lower <- as.numeric(quantile(Counts, lower, na.rm = TRUE))
      Counts[Counts < qt.lower] <- qt.lower
    }
    return(Counts)

}

scale_by_quantile(your_scaled_counts, .01, .99)

ADD COMMENT • link updated 3.0 years ago by i.sudbery 20k • written 3.0 years ago by ATpoint 86k

0

Entering edit mode

Also, talking about outliers reminds me to ask why cooksCutoff=FALSE is used? If cooksCutoff=TRUE leads to far fewer genes, it could simply be that those genes arn't genuinely different between cancer and normal, but the DE is being driven by a small number of outliers.

ADD REPLY • link 3.0 years ago by i.sudbery 20k

0

Entering edit mode

Hi, sorry for this very late response. Thanks for your suggestion. I'm actually still having this problem. Changing the definition of DE genes to FDR<0.05 AND logfc>1 or <-1 improved the heatmap a lot (first image). Just for my understanding: with scaling being a problem and outliers determining visual separation, you mean that for example a fold change of 10 gets a red color while it would be better if a fold change of >2 should also get a red color, right? Because of this many genes do not seem to be actually upregulated or downregulated because they do not have a bright red or bright blue color, correct?

I changed my code lines as follows (I take it with 'counts' you mean the counts after the vst/variance stabilizing step right?):

rld <- vst(ddsFiltered, blind=FALSE) #variance stabilizing 
de<- rownames(res[res$padj<0.05&(res$log2FoldChange>=1|res$log2FoldChange<=-1), ]) #choosing DE genes
de_mat <- assay(rld)[de,] #only choosing the DE genes from the entire count object
pheatmap(scale_by_quantile(de_mat, .01, .99),show_rownames = F,show_colnames = F,annotation_col =group)

There is more visual separation now but it is strange to me that none of the genes seem to be downregulated, they are just more or less upregulated (second image). Do you have an idea of what is going wrong here? FYI: GBM = cancer, AC = healthy donor

Also, would it not be better to just change the colors to FC <-2 is bright blue and FC>2 is bright red?

enter image description here

ADD REPLY • link 2.7 years ago by bart ▴ 50

0

Entering edit mode

As far as I can see, you do have genes that are both up and down - i.e. genes that are higher on the left hand side and genes that are higher on the righthand side.

enter image description here

The fact you have no negative numbers on the scale, I think is because you've changed how you are scaling the data. In your first call you use scale which both centres and scales the data by the mean and standard devidation. Here you use scale_by_quantile as recommended with ATPoint, but I think this use trimms off outliers. You might want to use scale after scale_by_quartile

ADD REPLY • link 2.7 years ago by i.sudbery 20k

0

Entering edit mode

I would use scale before the scale_by_quantile, so first standardize the data and then remove some outlier to avoid them dominating the color scale. For a perfectly fair clustering result it might make sense to run the hclust first on the scaled data and then only for the plotting apply the outlier trimming. The trimming is really just for making it look nicer to the eye, so you can visually see better separation.

ADD REPLY • link 2.7 years ago by ATpoint 86k