Question

How to identify which genes are responsible for the different cluster without PCA

1

Entering edit mode

4.2 years ago

camillab. ▴ 160

Hi,

I hope this is not a stupid question but I have done hierarchical cluster (euclidian distance matrix + complete linkage method) on a subset number of genes (8000) in my bulkRNAseq samples (40) and I found that 13 samples do not cluster as expected/predicted. I run also a PCA and, in line with the hierarchical cluster, those samples cluster far apart from the others.

Is there any way (or R package) I can identify which genes are responsible for the different cluster without using the PCA (eg., the identification of the loadings)?

Practical example in the dendrogram from this site dendo what makes purple samples( 7-13-16) differ from the red ones but also what makes the red + purple cluster in another brach/arm compared to the blue-green samples?

I guess there are genes that would make all the samples cluster together and genes that are very different so they would make the samples cluster far apart, and this could be potentially observed in terms of macro-differences (red/pruple vs green/blue) or micro-differences (red vs purple).

thank you in advance

Camilla

R hclust clustering bulkRNAseq • 2.0k views

ADD COMMENT • link updated 4.2 years ago by Friederike 9.0k • written 4.2 years ago by camillab. ▴ 160

1

Entering edit mode

I wouldn't use Euclidean distance in such a high dimensional space because it's most likely subject to distance concentration. If you nonetheless manage to get a good clustering it probably means you have a strong signal contributed by a limited number of genes. You should be able to identify them by looking at which ones contribute the most to the distances between cluster centres, i.e. rank the genes by the squared differences of their means in each cluster. Alternatively, you could train a classifier to predict membership to each cluster (using cluster membership as given labels) and examine the weights associated with each gene.

ADD REPLY • link 4.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

ones contribute the most to the distances between cluster centres

that's exactly what I wanted to do but I am not able to figure out how to get this information from hclust do you know any link I can look at to understand how to do it? thank you for your answer!

ADD REPLY • link 4.2 years ago by camillab. ▴ 160

0

Entering edit mode

To get the clusters, you need to cut the tree generated by hclust, for example with the cutree function.

ADD REPLY • link 4.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi (again),

I tried with the cutree function (cut the dendogram in 5 cluster) but I got only this results no info about which gene contribute most apart that I guess are the samples in cluster 1 to contribute most:

clu.k5
 1  2  3  4  5 
46  1  1  1  1

Where do I make mistake? here my dataset:

 A tibble: 6 x 51
  gene  `4_MU` `16_MU` `21_MU`  `0c` `0c_bs1_2` `0c_bs2`
  <chr>  <dbl>   <dbl>   <dbl> <dbl>      <dbl>    <dbl>
1 A4GA~  0.382   0.176   0.316  5.34       4.47     10.0
2 AAAS   3.13    5.22    5.02  28.8       24.2      19.9
3 AACS  21.2    19.7    16.9   13.3       14.0      13.1
4 AAGAB 14.7    22.7    18.8   35.3       37.5      45.4
5 AAK1  17.1    12.5    18.6   16.1       15.1      20.9
6 AAMP  63.8    72.7    65.7   23.4       19.9      16.6
# ... with 44 more variables: `24c` <dbl>,

and here my script:

#tidy the dataset
df1 <- df %>% drop_na() #remove rows with NA from the merged filed
rnames <- df1$gene#select name
df1 <- df1[-c(1)] # remove gene symbol
df2 <-(as.matrix(df1))
rownames(df2) <- rnames # assign row names
df3 <- t(df2) #transpose
df4 <- scale(df3) #scale

#hierechical cluster
d=dist(df4) #dissimilarity matrix
hc=hclust(d,method="complete")
plot(hc)

#cut in 5 cluster
clu.k5=cutree(hc,k=5)
rect.hclust(hc, k=5, border = "green")

ADD REPLY • link 4.2 years ago by camillab. ▴ 160

1

Entering edit mode

cutree returns a vector of cluster memberships. You then need to extract the data for each cluster with e.g. for cluster 1 df4[clu.k5==1, ]

ADD REPLY • link 4.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Is there a specific reason why you do not want to do PCA, it sounds like a good job for PCA. You can also visualize loadings plot using PCATools package. It is pretty easy to make

ADD REPLY • link 4.2 years ago by ashish ▴ 680

0

Entering edit mode

I did it but I would like to be able to discriminate between differences across all samples (which I can do with "loading" in the PCA) and those between specific groups and I cannot do it with PCA. Like with the PCA in my example before I can find what I called macro-differences (red/pruple vs green/blue) but not micro-differences (red vs purple) without removing samples so without changes the result of the PCA. I don't know if it makes sense

ADD REPLY • link 4.2 years ago by camillab. ▴ 160

score 0 · Answer 1 · 2020-10-19

0

Entering edit mode

4.2 years ago

Friederike 9.0k

Is there any way (or R package) I can identify which genes are responsible for the different cluster without using the PCA

Yes, DESeq2, edgeR and limma would be the most popular tools to achieve this, i.e. compare replicates of specific groups of cells/samples to each other. All details can be found here, but for a less involved analysis you could also give pcaExplorer a shot that will take care of many of the details for you.

ADD COMMENT • link 4.2 years ago by Friederike 9.0k

0

Entering edit mode

Do I need raw read to use with DESeq2, edgeR and limma right?

ADD REPLY • link 4.2 years ago by camillab. ▴ 160

0

Entering edit mode

yes, that's correct.

ADD REPLY • link 4.2 years ago by Friederike 9.0k