Hello,
I'm completely new to bioinformatics and machine learning. I have a dataframe with pre-processed data where rows are genes and columns are samples (column 1 is probe ID, then the rest of columns are cancer samples and normal samples). I want to use kmeans in R to do clustering on my data by samples with 2 as the initial number of clusters. So far I have been doing some research on kmeans clustering and came up with the code below which seems to be working but since I'm new to this, not sure if this is correct? Also I want to draw a line chart to indicate the profile of the 2 clusters by using the center of each cluster but dont know how to do that.. Perhaps someone can help me with some guidance, examples of how to do this properly? Thank you!
clustering<- kmeans(df[ ,2:21], 2)
clustering$cluster
new <- cbind(df, cluster = clustering$cluster)
View(new)
Thank you so much, Kevin, that's very helpful!
Hi Kevin,
I actually have 100 interesting genes and would like to use them for classifying the samples. In this case, as I want to classify samples the matrix should have samples as rows and genes as columns. Am I right?
I cannot recall, but, irrespective, you just need to transpose the matrix via the
t()
function to get what you wantWould you actually use those counts (as in transformed CPM), or would you rather conduct feature scaling using using the z-score prior to clustering (as in base::scale)? Thank you!
For k-means, I would use the normalised + transformed expression levels, i.e.,
log2 (CPM + pseudocount)
, or, indeed, the Z-scaled version of these, i.e.,scale(log2 (CPM + pseudocount))