Hi folks!
I have 1000+ binary vectors of the same length that consist of 0/1 positions corresponding to variants in the human genome and their functional annotation (i.e. transcription factor binding sites, protein-coding regions, methylation sites, etc.). Thus, each vector differs in the order, frequency, and distribution of 0s and 1s. [These correspond to functional genomic annotations of the human genome based on the ENCODE data, the Roadmap epigenomics data, and other studies, downloaded with the Garfield package].
Example vectors (with a length of 40 bases):
Transcription factor 1 binding sites: 1111111111101000000000000000000000000000
Protein coding region: 0000000000000000000000000000001111111111
DNA methylation sites: 1111111111111000000100000000000000000000
H3K36me3 sites: 0000011100000111100000001011100000000000
My question is: is there a way to calculate the similarity between these vectors? Would this be a simple correlation? Or maybe there is a tool that allows me to do this? I've been stuck on this for a while, and I thought that maybe you guys could share your thoughts, or redirect me to a relevant post?
Many thanks!
P.S.: Garfield calculates the enrichment of these annotations in GWAS results, but not the enrichment of these annotations in other annotations.