Question

HOG without numbering in the genome of the ancestor reconstructed using the pyHAM

0

Entering edit mode

3.6 years ago

Maksim Nesterenko ▴ 20

Dear colleagues,

Recently I successfully complete OMA standalone run and now I am trying to analyze results using the pyHam python package. The main task for me is to reconstruct the genome of the ancestor of the analyzed species. After executing the following commands, I get information that in the model of the ancestor genome are 10372 genes:

ham_analysis = pyham.Ham(nwk_file, orthoxml_file, use_internal_name=True)

ancestral_genome = ham_analysis.get_ancestral_genome_by_name(ancestral_genome_name)

ancestral_genes = ancestral_genome.genes

print(len(ancestral_genes))

I am a little alarmed by two facts: 1) In ancestral_genes, many elements (HOG) have this designation <HOG ()>, and do not have a number, in contrast to, for example, <HOG (17899)> or <HOG (17900)>. Is it mistake if previously the sequences assigned to <HOG ()> were included in numbered orthogroups? Or am I missing something? The same designations are found even in the tutorial (https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html) [In [37]].

2) However, if we extract information about the descendant genes in these <HOG ()>, we will see that some of these groups of orthologs contain only 1 or 2 sequences of modern species. Correct me, please, if I am mistaken, but how can it be considered that the ancestor had a gene if it is present only in 1 or 2 species out of, for example, 11 analyzed? Is it worth specifying a certain threshold for the number of modern species in which a gene must be present to reconstruct the ancestor's genome? And how to do it most efficiently and correctly?

I would be grateful for any help!

HOG OMA ancestor pyHAM genome • 1.8k views

ADD COMMENT • link updated 3.1 years ago by jm440 ▴ 10 • written 3.6 years ago by Maksim Nesterenko ▴ 20

0

Entering edit mode

How did you extract information about the descendent genes in HOGs without numbers <HOG()> ? I have a ton of these but I can't find a way to do it.

ADD REPLY • link 3.2 years ago by jm440 ▴ 10

1

Entering edit mode

you can use the method get_all_descendant_genes() of the Hog objects to retrieve the extant genes.

ADD REPLY • link 3.1 years ago by Adrian Altenhoff ★ 1.1k

0

Entering edit mode

Thank you, I think I might be having a syntax issue with get_all_descendant_genes()

>>> hogs1=vertical_compare.get_lost()
>>> hogs1
{<HOG()>, <HOG()>, <HOG()>}
>>> hogs1.get_all_descendant_genes()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'set' object has no attribute 'get_all_descendant_genes'

ADD REPLY • link 3.1 years ago by jm440 ▴ 10

1

Entering edit mode

you need to iterate over all hogs in the hog1 set (hogs1 is a set of lost hogs).

for hog in hogs1:
    hog.get_all_descendant_genes()

ADD REPLY • link 3.1 years ago by Adrian Altenhoff ★ 1.1k

0

Entering edit mode

ahh that makes sense. Thank you!

ADD REPLY • link 3.1 years ago by jm440 ▴ 10

score 0 · Answer 1 · 2021-12-06

Dear Maksim,

Point #1: Only root HOGs are labelled with ID in the orthoxml. While printing a root HOG will display <HOG (17900)> printing a sub HOG (an HOG at a lower taxonomic range) will display <HOG ()>.

If you need to identifying all HOGs with unique ids then you should use built-in python object id

id(hog_object)

Point #2: Without having the reference species tree I can't help more about this 2/11 species coverage but what I can say is that such scenario is possible. it usually better to think in number of gene loss rather that # of species not covered by the HOG.

Nevertheless, in the OMA parameter file you can change the threshold for merging HOGs; in your case you can see that as as much HOGs need to be complete in term of species coverage to be clustered together.

MinEdgeCompletenessFraction := 0.65;

I hope I answered all your questions, Clement