Biopython - How to break down lineage from either nucleotide db or taxonomy db?
1
0
Entering edit mode
2.6 years ago
Nico ▴ 20

Hello Biostars,

I'm trying to break down the lineage of a genbank entry into it's ranks (phylum, genus, species, strain). I'm not a biologist so I apologize in advance if any terms are misused.

I got to the point where I can talk to genbank and either get the taxonomy info via the nucleotide db record:

with Entrez.efetch(db='nucleotide', id=accession, rettype='gb') as handle: 
    record = SeqIO.read(handle, "gb")
    genbank_organism_taxonomy = record.annotations["taxonomy"]

This gives me a list, for example:

['Bacteria', 'Cyanobacteria', 'Pseudanabaenales', 'Oculatellaceae', 'Tildeniella']

Which is great to read as a human, but there are no keys associated with the values, so I cannot easily tell which value is a kingdom, genus, species, etc. Does anyone know if I could figure this out by counting items? And if so, what is what in the order?

If that is not the way, I also found that I could retrieve the taxa ID of the organism, and use that to search in the taxonomy db:

def get_taxonomy(tax_id): #uses TaxaID to get the taxa 
with Entrez.efetch(db='taxonomy', id=tax_id, retmode='xml') as handle:
    record = Entrez.read(handle, validate=False)
    return record

In this case, the record holds way more info and I see some ranking under 'LineageEx':

'LineageEx': [{'TaxId': '131567', 'ScientificName': 'cellular organisms', 'Rank': 'no rank'}, {'TaxId': '2', 'ScientificName': 'Bacteria', 'Rank': 'superkingdom'}, {'TaxId': '1783272', 'ScientificName': 'Terrabacteria group', 'Rank': 'clade'}, {'TaxId': '1798711', 'ScientificName': 'Cyanobacteria/Melainabacteria group', 'Rank': 'clade'}, {'TaxId': '1117', 'ScientificName': 'Cyanobacteria', 'Rank': 'phylum'}, {'TaxId': '2881377', 'ScientificName': 'Pseudanabaenales', 'Rank': 'order'}, {'TaxId': '2303507', 'ScientificName': 'Oculatellaceae', 'Rank': 'family'}, {'TaxId': '2303519', 'ScientificName': 'Tildeniella', 'Rank': 'genus'}]

This seems to be what I need, but I don't know how to extract this info to break it down into either a dictionary or an object where I have a series of Rank: value.

This might be a Python ignorance issue more than a bio issue, but I figured this would be the place to ask. I'm happy using either option,

Thank you in advance for your help!

python genbank biopython • 1.5k views
ADD COMMENT
2
Entering edit mode

The TaxID is definitely the way I would approach this. If you're OK going outside biopython, I'd suggest taking a look at ETE3, since it has a whole module for exactly this kind of thing.

ADD REPLY
0
Entering edit mode

I'll look into it! I'm using biopython because is what I knew about, but ETE3 is also Python, which is the aspect I care about. Thanks!

ADD REPLY
1
Entering edit mode
2.6 years ago

The proper way to do this is with the bio package:

bio taxon 2303519 --lineage

prints:

no rank,131567,cellular organisms
  superkingdom,2,Bacteria
    clade,1783272,Terrabacteria group
      clade,1798711,Cyanobacteria/Melainabacteria group
        phylum,1117,Cyanobacteria
          order,1890424,Synechococcales
            family,2303507,Oculatellaceae
              genus,2303519,Tildeniella

see:

https://www.bioinfo.help/index.html

the tool is developed in Python and it has a very elegant and simple database that you can quickly traverse if the above is insufficient.

You could also call the API that generates the above.

ADD COMMENT
0
Entering edit mode

I'm struggling to figure out how to merge this into my script. I see how I would run it from a terminal but I don't see any mentions of an API in the docs. I tried googling it but "bio" and "python" usually bring up biopython, and "bioinfo" brings up unrelated bioinformatics things. Do you happen to know any resources I could read about the API or how to use it as a python module perhaps?

Thanks again.

ADD REPLY
0
Entering edit mode

The API is not well documented :-) but in return, I designed it in what I like to call "ludicrously" simple ;-)

the "database" behaves like a dictionary (it is backed by an SQLite database behind the scenes). That dictionary is keyed by the taxid.

https://github.com/ialbert/bio/blob/master/biorun/taxon.py#L256

with it you can write:

from biorun.taxon import get_data
names, graph = get_data()
print(names['2697049'])

it will print:

['Severe acute respiratory syndrome coronavirus 2', 'no rank', '', '694009', 0]

The fourth element is the parent taxid for the current taxid.

parent_taxid = names['2697049'][3]  
print(names[parent_taxid])

that prints:

['Severe acute respiratory syndrome-related coronavirus', 'species', '', '2509511', 0]

To find the lineage navigate up the tree. At the top, the parent is the same as the taxid. That's it

If you want to see that done in bio see this print_lineage function:

https://github.com/ialbert/bio/blob/master/biorun/taxon.py#L230

you would use that as:

from biorun.taxon import print_lineage
taxid = '2697049'
print_lineage(taxid, names=names)

that will print:

superkingdom, 10239, Viruses
  clade, 2559587, Riboviria
    kingdom, 2732396, Orthornavirae
      phylum, 2732408, Pisuviricota
        class, 2732506, Pisoniviricetes
          order, 76804, Nidovirales
            suborder, 2499399, Cornidovirineae
              family, 11118, Coronaviridae
                subfamily, 2501931, Orthocoronavirinae
                  genus, 694002, Betacoronavirus
                    subgenus, 2509511, Sarbecovirus
                      species, 694009, Severe acute respiratory syndrome-related coronavirus
                        no rank, 2697049, Severe acute respiratory syndrome coronavirus 2

make your own function based on print_lineage, instead of printing do something else

Note: the default object is not a "real" dictionary it just looks like one. Don't perform operations that need to access the entire dictionary for example traverse all fields. Each key access is a database query. The example code below will take a substantial time to finish because while each access is in milliseconds there are over 2 million entries in the taxonomical database, each print will take milliseconds

for name in names:
    print (name)

If you need to operate on the entire dictionary pass the preload=True in the get_data(preload=True). The preload will take about 10 seconds to materialize the entire thing into a "real" dictionary that will then operate at in-memory speeds.

ADD REPLY

Login before adding your answer.

Traffic: 2533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6