Clustering and extracting gene IDs with same expression profiles
1
0
Entering edit mode
6.7 years ago
lessismore ★ 1.4k

Hey all,

i have a specific need to cluster more than 1K genes based on their expression profile. I want to extract then just the genes in clusters with the same expression profile. My first attempt was to set the optimal K with gapstat and using kmeans to mark my expression atlas with the cluster identifier and then extract the genes from a specific cluster by subsetting the dataframe. This didnt work because kmeans tries to put together genes even with different expression profile.

So do you have any suggestion to accomplish this?
Summarizing :

  1. i want to cluster a big expression atlas
  2. extract the gene IDs with similar expression profile

thanks in advance

p.s.

i saw this post https://support.bioconductor.org/p/93424/ about pheatmap which would be ideal but i cannot figure out how to check the clusters identifier in order to use cutree function

clustering R expression profiles • 2.5k views
ADD COMMENT
1
Entering edit mode

Hi,

in the example from Bioconductor the object after cutree execution contains the genes with the cluster identifier (a number from one to n - an integer you have chosen for cutree).

ADD REPLY
0
Entering edit mode
6.7 years ago
Jake Warner ▴ 840

Hi, You could use your kMeans approach then score the individual genes by comparing them to the centroid.
First get the centroids:

# function to find centroid in cluster i
clust.centroid = function(i, dat, clusters) {
  ind = (clusters == i)
  colMeans(dat[ind,])
}
kClustcentroids <- sapply(levels(factor(clusterdata$cluster)), clust.centroid, scaledata, clusterdata$cluster)

where clusterdata is the result of kmeans and scaledata is your expression dataframe.

Then compare the genes to the cluster cores:

#get just cluster 2
K2 <- (scaledata[clusterdata$cluster==2,])
#get cluster 2 core
core <- kClustcentroids[2,]

#compare them with cor
corscore <- function(x){cor(x,core)}
score <- apply(K2, 1, corscore)

The scores will relate to how close they match the cluster core (from 0 to 1). Here's an example of plotting them:

enter image description here

Then you could just take the genes with a score above a certain cutoff (like 0.75). Complete workflow here

Good luck!

ADD COMMENT

Login before adding your answer.

Traffic: 1914 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6