Question

K-Means Clustering

1

Entering edit mode

11.2 years ago

Maria ▴ 10

I want to do k-means clustering on 5 chip-seq samples (time series). I counted number of tags over 1kb window genome wide for each sample. then I have an input like:

chr1  343000   344000   23    43  5   78    45
.
.
.

I would like to find those regions that gain the histone mark signal faster or slower. I never did clustering and I have some basic questions. what would be my input? If I provide a numeric matrix from tag counts (5 columns), how can I keep coordinates during clustering?

PS: would be a great help if somebody can show me a step by step tutorial on these kinds of stuff.

• 4.4k views

ADD COMMENT • link updated 11.2 years ago by Alex Reynolds 36k • written 11.2 years ago by Maria ▴ 10

score 7 · Answer 1 · 2013-09-21

7

Entering edit mode

11.2 years ago

Devon Ryan 104k

The input would be the 5 columns of counts or any similar metric that you want to use. You can keep the coordinates by either making them the row.name (so "chr1:343000-344000" for the first row", or just subset the data frame when you give it to kmeans. The output from the kmeans function (such as $cluster) are in the same order as the input, so you don't have to worry about things getting rearranged. There are a number of nice tutorials on the web, such as this one here. BTW, you might try using something like seqMINER, which can do the clustering for you (though I've never used it).

ADD COMMENT • link 11.2 years ago by Devon Ryan 104k

1

Entering edit mode

Seqminer will do the job for you. You provide a BED file of coordinates for the regions (X-axis) and a BED file of mapped reads for each conditions, so five in your case. You can alter the bin length in the options. It is recommended that the reads be normalised within Seqminer using the linear-normalisation setting.

ADD REPLY • link 11.2 years ago by Ian 6.1k