Tss Distance Vs Chipseq Tag K Mean Clustering
1
0
Entering edit mode
11.6 years ago
kanwarjag ★ 1.2k

I am trying to perform K means clustering on TSS distance vs chipseq tag density. My aim is generate heat map as shown @ Fig 2E of http://www.ncbi.nlm.nih.gov/pubmed/18992931

I generated tag densities using Hommer around TSS 1000/ on both sides (2k total) in a bin of 50 bp It provided me matrix which i take to Cluster 3 to perform K means clustering, I can also use other tools to perform such clustering. However All of them kind of freeze and complain about memmory etc. I am not so very good in command line tools.Having said that I think one of the solution is to reduce the data in the matrix generated by Hommer. I have tried to use filter tools in Cluster 3 but failed to reduce the data. Could some one suggest how I can reduce the data before performing K means clustering. My reads are 50bp and this facor tightly bound around TSS so have selected 1k on either side of TSS.

Thanks

map chipseq • 4.6k views
ADD COMMENT
0
Entering edit mode

What are the spec of the machine you are running the software on? If all software complain about memory an easy fix is to add memory ;-) There are many solution around to cluster big data. 2kb split by 50bp give 400 regions. What is the number of TSS in your matrix?

ADD REPLY
0
Entering edit mode

(A: Why does the Homer tool find TSS sites for so many (41,478) genes?). Hommer identify 41478TSS mapped. X43 columns when I use 1000bp across TSS with 5pb bins. I am using windows 7; 64 bit; 12 GB ram i7 cpu. I also have access to iMAC

ADD REPLY
0
Entering edit mode
11.6 years ago
David ▴ 740

k-means algorithms are very efficient and there should be no problem clustering your data. I just ran it in R on a fake data set and it took me only a few second.

> mat <- matrix(rnorm(n=43*41000), ncol=43)
# dimensions
> dim(mat)
[1] 41000    43
> r <- kmeans(mat, centers=3)
Warning message:
did not converge in 10 iterations
> r
K-means clustering with 3 clusters of sizes 13868, 13492, 13640

Cluster means:
[...]

Anyway, my main concern here is that you have a protein in your chip-seq data that bind to almost every TSS in the genome... Maybe you are dealing with DNase data... or the peak calling algorithm had a problem.

ADD COMMENT

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6