How should I analyze differences between phyletic patterns?
1
0
Entering edit mode
9 months ago
Igor • 0

I have found orthogroups (by OrthoFinder) in full archaeal proteoms of genus Halorubrum. As a result I have a dataframe with number of proteins in each orthogroup of each organism (number of orthogroup in rows and species in columns) where I have changed every number that is more than 1 to 1 to make phyletic patterns. In the end I have this dataframe: Phyletic patterns of genus Halorubrum - df 'ogroups_patterns'

There are thermophilic (['aethiopicum', 'coriense', 'tebenquichense', 'vacuolatum', 'lipolyticum', 'saccharovorum', 'terrestre', 'salsamenti','yunnanense', 'sodomense', 'distributum', 'aidingense', 'arcis']) and non-thermophilic organisms. The question is: how should I analyze this data if my goal is to find differences of thermophilic patterns in compare to non-thermophilic?

I tried to calculate Jaccard index in every orthogroup

ogroups_patterns['J'] = ogroups_patterns_terms.sum(axis = 1, numeric_only = True) / ogroups_patterns.sum(axis = 1, numeric_only = True)

where ogroups_patterns_terms is a df with phyletic patterns as in the screenshot above, but for thermophiles only

But I have no idea is this the correct way to calculate this index in this case. Maybe allowing zeros in the formula will be a good idea, but Im not sure how to code it. Every little tip would be extremely helpful, really stucked at this part and have no ideas what to do and how to code it. Bigbig thanking in advance!

phylogeny thermophiles proteoms phyletic-patterns • 565 views
ADD COMMENT
2
Entering edit mode
9 months ago
Mensur Dlakic ★ 28k

If you want to cluster by organisms, I suggest you transpose the matrix so the organisms are in rows and genes in columns. Then you can apply any of the dimensionality reduction methods (PCA, t-SNE, UMAP) to reduce the dataset to 2 or 3 dimensions. If your initial hypothesis is correct, thermophiles and non-thermophiles will be in separate groups.

If you want to cluster by genes rather than by organisms, you don't need to do matrix transposition. In that case you are likely to get much more than just two clusters.

Generally speaking, any clustering method can work with the data you have, although you may need to sparsify it by converting zeros to missing values. There are many clustering techniques in python, scikit-learn package specifically.

https://scikit-learn.org/stable/modules/clustering.html

Food for thought:

ADD COMMENT
0
Entering edit mode

Thank you very much, your answer was exremely helpful!

ADD REPLY
0
Entering edit mode

Please accept the answer (green check mark) to provide closure for this thread.

ADD REPLY

Login before adding your answer.

Traffic: 1579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6