I want to do k-means clustering on 5 chip-seq samples (time series).
I counted number of tags over 1kb window genome wide for each sample. then I have an input like:
chr1 343000 344000 23 43 5 78 45
.
.
.
I would like to find those regions that gain the histone mark signal faster or slower.
I never did clustering and I have some basic questions. what would be my input? If I provide a numeric matrix from tag counts (5 columns), how can I keep coordinates during clustering?
PS: would be a great help if somebody can show me a step by step tutorial on these kinds of stuff.
The input would be the 5 columns of counts or any similar metric that you want to use. You can keep the coordinates by either making them the row.name (so "chr1:343000-344000" for the first row", or just subset the data frame when you give it to kmeans. The output from the kmeans function (such as $cluster) are in the same order as the input, so you don't have to worry about things getting rearranged. There are a number of nice tutorials on the web, such as this one here. BTW, you might try using something like seqMINER, which can do the clustering for you (though I've never used it).
Seqminer will do the job for you. You provide a BED file of coordinates for the regions (X-axis) and a BED file of mapped reads for each conditions, so five in your case. You can alter the bin length in the options. It is recommended that the reads be normalised within Seqminer using the linear-normalisation setting.
Seqminer will do the job for you. You provide a BED file of coordinates for the regions (X-axis) and a BED file of mapped reads for each conditions, so five in your case. You can alter the bin length in the options. It is recommended that the reads be normalised within Seqminer using the linear-normalisation setting.