I have data from the tool ChromHMM. What it does is split up the genome/chr into bins (say 200bp) and assign each bin a state (for those interested, the assignment is based on an observation sequence which are the combination of histone modifications - see Earnst/Kellis 2009 paper).
The data looks like the following (in a .bed file):
dense file:
chr10 0 3020800 13 0 . 0 3020800 255,255,204
chr10 3020800 3021600 16 0 . 3020800 3021600 102,153,51
chr10 3021600 3022200 13 0 . 3021600 3022200 255,255,204
chr10 3022200 3022600 6 0 . 3022200 3022600 0,102,0
chr10 3022600 3033600 13 0 . 3022600 3033600 255,255,204
chr10 3033600 3034200 2 0 . 3033600 3034200 0,153,204
chr10 3034200 3034400 6 0 . 3034200 3034400 0,102,0
chr10 3034400 3036800 13 0 . 3034400 3036800 255,255,204
chr10 3036800 3037200 1 0 . 3036800 3037200 0,0,255
chr10 3037200 3040800 13 0 . 3037200 3040800 255,255,204
or alternative file:
chr10 0 3020800 E13
chr10 3020800 3021600 E16
chr10 3021600 3022200 E13
chr10 3022200 3022600 E6
chr10 3022600 3033600 E13
chr10 3033600 3034200 E2
chr10 3034200 3034400 E6
chr10 3034400 3036800 E13
chr10 3036800 3037200 E1
chr10 3037200 3040800 E13
Basically, what is says that all the bins of size (200) from position 0 to 3020800 were assigned state 13 (~15000 bins). I also have another file that tells me the state PER bin but this is incredibly large file. That simple looks like this:
cell_MB chr10
MaxState E
13
13
13
13
13
13
13
13
What I want to do
- Calculate the distance from the bin to the nearest gene. Get the gene information. I will use DAVID to perform GO.
- percentage of bin (for a fixed state, say E13) that are within 2kb of a TSS region
It is the second bullet point that is more important to me.
Does anyone know of a tool to do this or a R package to do this? Like I mentioned, I have the data is three different formats so any tool/package that accepts these files would be awesome.
I ask this here because previously I've coded stuff from scratch in R taking months and then realizing a R package already exists and does exactly what I coded.