Starting from a few tens of thousands of ChIP-seq peaks that contain transcription factor recognition sequences identifiable with a motif search program, how can one represent each of these words as a network cloud where every node would represent every word found in a peak, the size of the nodes would represent the number of times the word is found, and the edges represent the DNA sequence distance (mismatches) between two words. Is there any software to produce that network?
Word length I am expecting would be typical of TFs as found on Jaspar, for example, say, 10-mers.
how long would your words be? typically you generate these motifs either de novo by comparing ~100mers under every peak summit or use a pre-existing db of motifs (ie jasper). the motifs themselves are not very long <10bp and in the db of motifs stored as position weight matricies (pwm) so then would your edges be a measure of differences between these pwm? or diff between 100mers?