Hello Biostars,
I'm trying to break down the lineage of a genbank entry into it's ranks (phylum, genus, species, strain). I'm not a biologist so I apologize in advance if any terms are misused.
I got to the point where I can talk to genbank and either get the taxonomy info via the nucleotide db record:
with Entrez.efetch(db='nucleotide', id=accession, rettype='gb') as handle:
record = SeqIO.read(handle, "gb")
genbank_organism_taxonomy = record.annotations["taxonomy"]
This gives me a list, for example:
['Bacteria', 'Cyanobacteria', 'Pseudanabaenales', 'Oculatellaceae', 'Tildeniella']
Which is great to read as a human, but there are no keys associated with the values, so I cannot easily tell which value is a kingdom, genus, species, etc. Does anyone know if I could figure this out by counting items? And if so, what is what in the order?
If that is not the way, I also found that I could retrieve the taxa ID of the organism, and use that to search in the taxonomy db:
def get_taxonomy(tax_id): #uses TaxaID to get the taxa
with Entrez.efetch(db='taxonomy', id=tax_id, retmode='xml') as handle:
record = Entrez.read(handle, validate=False)
return record
In this case, the record holds way more info and I see some ranking under 'LineageEx':
'LineageEx': [{'TaxId': '131567', 'ScientificName': 'cellular organisms', 'Rank': 'no rank'}, {'TaxId': '2', 'ScientificName': 'Bacteria', 'Rank': 'superkingdom'}, {'TaxId': '1783272', 'ScientificName': 'Terrabacteria group', 'Rank': 'clade'}, {'TaxId': '1798711', 'ScientificName': 'Cyanobacteria/Melainabacteria group', 'Rank': 'clade'}, {'TaxId': '1117', 'ScientificName': 'Cyanobacteria', 'Rank': 'phylum'}, {'TaxId': '2881377', 'ScientificName': 'Pseudanabaenales', 'Rank': 'order'}, {'TaxId': '2303507', 'ScientificName': 'Oculatellaceae', 'Rank': 'family'}, {'TaxId': '2303519', 'ScientificName': 'Tildeniella', 'Rank': 'genus'}]
This seems to be what I need, but I don't know how to extract this info to break it down into either a dictionary or an object where I have a series of Rank: value.
This might be a Python ignorance issue more than a bio issue, but I figured this would be the place to ask. I'm happy using either option,
Thank you in advance for your help!
The TaxID is definitely the way I would approach this. If you're OK going outside biopython, I'd suggest taking a look at
ETE3
, since it has a whole module for exactly this kind of thing.I'll look into it! I'm using biopython because is what I knew about, but ETE3 is also Python, which is the aspect I care about. Thanks!