I currently have a dataframe that states particular gene clusters within genomes, this is defined as a well-formatted tab-delimited file, which looks basically like the dataframe below (Example):
Gene Cluster Genome
---------------------------------------
GCF3372 Streptomyces_hygroscopicus
GCF3371 Streptomyces_xiamenensis
The general idea based on this I want to measure the occurrence of a GCF per Genome, as well as the co-occurrence of a given GCF with others in various genomes
What type of visualizations also would anyone suggest for this type of data? The most what I thought best would be the use of a heatmap for this case!
Suitable data type can depend on how many genomes and how many gene clusters you have. Would a hash table or a dictionary work (Genome as key and GCFs as values)? I'd suggest Hash package in R or Dictionary in python
A binary matrix and Genomes as rows and GCFs as columns and two GCFs that co-occure in a genome would have value 1 for the row for that genome (The matrix would be sparse I assume). I'd suggest pandas package in python
Thanks for your input on this, I will try the revision of a binary matrix based on packages mentioned. I just want a binary matrix more or less so I can eventually plot a heatmap and conduct statistical analysis, even something simple such as chi squared to measure the proper correlation.
Thanks again @Fatima :D