I have found orthogroups (by OrthoFinder) in full archaeal proteoms of genus Halorubrum. As a result I have a dataframe with number of proteins in each orthogroup of each organism (number of orthogroup in rows and species in columns) where I have changed every number that is more than 1 to 1 to make phyletic patterns. In the end I have this dataframe:
There are thermophilic (['aethiopicum', 'coriense', 'tebenquichense', 'vacuolatum', 'lipolyticum', 'saccharovorum', 'terrestre', 'salsamenti','yunnanense', 'sodomense', 'distributum', 'aidingense', 'arcis']) and non-thermophilic organisms. The question is: how should I analyze this data if my goal is to find differences of thermophilic patterns in compare to non-thermophilic?
I tried to calculate Jaccard index in every orthogroup
ogroups_patterns['J'] = ogroups_patterns_terms.sum(axis = 1, numeric_only = True) / ogroups_patterns.sum(axis = 1, numeric_only = True)
where ogroups_patterns_terms is a df with phyletic patterns as in the screenshot above, but for thermophiles only
But I have no idea is this the correct way to calculate this index in this case. Maybe allowing zeros in the formula will be a good idea, but Im not sure how to code it. Every little tip would be extremely helpful, really stucked at this part and have no ideas what to do and how to code it. Bigbig thanking in advance!
Thank you very much, your answer was exremely helpful!
Please accept the answer (green check mark) to provide closure for this thread.