Hi there,
Thanks so much for all of your help so far. I recently completed a successful OMA standalone run and am now digging into the results using the PyHam python package. One thing that I have noticed so far is that there seems to be a mismatch between the gene IDs as laid out by OMA in the Map-SeqNum-ID.txt file and the gene IDs stored in the HOG orthoXML file.
For example, I have been trying to access information about a particular sequence listed as below in the mapping file:
branchiostoma_floridae 8460 XP_002593948.1 hypothetical protein BRAFLDRAFT_98245 [Branchiostoma floridae]
I have loaded my species tree and OrthoXML file into PyHam and have tried querying by gene ID, like so:
gene_8460 = metazoa_ham.get_gene_by_id(8460)
print(gene_8460.get_dict_xref())
This returned: {'id': '8460', 'protId': 'oki.206.4.t1’}
which is obviously not the gene I was actually trying to query.
I tried searching in reverse to cofirm, i.e. using the external gene name:
test_gene = metazoa_ham.get_genes_by_external_id('XP_002593948.1 hypothetical protein BRAFLDRAFT_98245 [Branchiostoma floridae]')[0]
print(test_gene.get_dict_xref())
Which returned: {'id': '137320', 'protId': 'XP_002593948.1 hypothetical protein BRAFLDRAFT_98245 [Branchiostoma floridae]'}
So it appears that the external ID matches what is expected, but this sequence is stored as 8460 in the Mapping list and 137320 in the OrthoXML. I tested several other sequences in this way and had similar results.
My main question is: Is ID mismatch a symptom of something having gone wrong during the run, or is it an expected behavior? As long as I have some way of accurately querying some sequences of interest to get their root level HOGs etc. should I not be worried about this?
Additionally, is there some way get PyHam to write out a list of OrthoXML gene IDs and their associated external IDs so I don't always need to use the cumbersome external sequence names?
Thank you once again!
Sally Chang