Question

Tool For Binning Windowbed Output For K-Means Clustering

1

Entering edit mode

11.3 years ago

bede.portz ▴ 540

I have mapped high resolution ChIP-seq data to transcription start sites using windowBed. I now want to bin the data, in bin sizes of my choosing, relative to TSSs so that I can generate heat maps and do k-means clustering on the data.

What tool/s exist for doing this?

Thanks!

bedtools chip-seq clustering heatmap • 3.4k views

ADD COMMENT • link 11.3 years ago by bede.portz ▴ 540

score 3 · Answer 1 · 2013-09-24

3

Entering edit mode

11.3 years ago

vj ▴ 520

You can take a look at seqMiner a standalone software.

ADD COMMENT • link 11.3 years ago by vj ▴ 520

0

Entering edit mode

+1 You could simply enter the coordinates of the TSSs and then the mapped reads from your ChIP-seq seq data.

ADD REPLY • link 11.3 years ago by Ian 6.1k

Istvan Albert · Answer 2 · 2013-10-01

Update: HOMER can carry out this procedure very quickly. In a single command it can align reads as a BED file to the TSS (or other features) and generate histograms or a matrix suitable for clustering. The window around the TSS can be specified, as can the bin size.

The specific feature of HOMER that accomplishes this is annotatePeaks.pl

usage:

annotatePeaks.pl tss ~/pathToGenome -size <range around TSS> -hist <bin size> -ghist -p Read/Peak file > output.txt

Where tss specifies a TSS centric analysis, path to the genome directs to the genome as downloaded and indexed via the configureHomer.pl script from the command line, range around the TSS specifies the range on each side of the TSS into which reads will be mapped (i.e. 1000 is 500bp upstream and downstream of TSS) -p specifies that the peak file is in a BED format and -ghist provides a gene by gene histogram (i.e. a matrix that can be sorted or clusterd by other programs).

I hope this helps.

Bede

score 0 · Answer 3 · 2013-09-24

In a Annotatepeaks Function From Homer, someone suggested using AnnotateGenomicRegions. I'm not sure if you can use it via command line. At the very least it appears to be a very quick and easy way to obtain the gene annotation for each read, then you can sort and count the number of times each gene appears within the results section in your favorite code. I don't know how one would be able to bin for specific areas of the gene using this program at first glance.

edit: For anyone who doesn't know how to count the number of unique genes, I might use an awk script...

 cat annotations_1380035469204_3516.txt | awk ' { print $2}' | sort  | uniq -c | sort

The "annotations_1380035469204_3516.txt" file would be the output of AnnotateGenomicRegions then you could take the first annotation (second column) and count it. I've noticed sometimes there are multiple annotations for a given read so you will have to think about what to do in those cases. Good luck!

score 0 · Answer 4 · 2013-09-25

0

Entering edit mode

11.3 years ago

bede.portz ▴ 540

I have installed SeqMINER, but the documentation doesn't explain how to alter bin sizes. I want to map the data to TSS in small bins (5-10bp).

Any advice on usage?

Thanks.

ADD COMMENT • link 11.3 years ago by bede.portz ▴ 540

0

Entering edit mode

There is a option in the Tools --> Options --> Clustering options --> Wiggle step. See it changing that helps.

ADD REPLY • link 11.3 years ago by vj ▴ 520