Hey guys, I'm new to python and general bioinformatics.
I'm currently working on a project that requires I translate information from two excel files (Each with column for species/ common name) into a taxonomy ID. Since the orignal species/common names are not always accurate, I found a function online that would find the best correct species name. There is also a function that will translate the species name to taxonomy ID. Both functions are found under ETE3
I don't know what values/variables would go in the functions(at the end of the list) to get a result.
My current code in python(Visual Studio Code) after activating anaconda is
import pandas as pd
import numpy as np
import ete3
pip install ncbi-taxonomist
Which gives Note: you may need to restart the kernel to use updated packages.
from ete3 import NCBITaxa
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
def get_fuzzy_name_translation(self, name, sim=0.9):
'''
Given an inexact species name, returns the best match in the NCBI database of taxa names.
:argument 0.9 sim: Min word similarity to report a match (from 0 to 1).
:return: taxid, species-name-match, match-score
'''
import sqlite3.dbapi2 as dbapi2
_db = dbapi2.connect(self.dbfile)
_db.enable_load_extension(True)
module_path = os.path.split(os.path.realpath(__file__))[0]
_db.execute("select load_extension('%s')" % os.path.join(module_path,
"SQLite-Levenshtein/levenshtein.sqlext"))
print("Trying fuzzy search for %s" % name)
maxdiffs = math.ceil(len(name) * (1-sim))
cmd = 'SELECT taxid, spname, LEVENSHTEIN(spname, "%s") AS sim FROM species WHERE sim<=%s ORDER BY sim LIMIT 1;' % (name, maxdiffs)
taxid, spname, score = None, None, len(name)
result = _db.execute(cmd)
try:
taxid, spname, score = result.fetchone()
except TypeError:
cmd = 'SELECT taxid, spname, LEVENSHTEIN(spname, "%s") AS sim FROM synonym WHERE sim<=%s ORDER BY sim LIMIT 1;' % (name, maxdiffs)
result = _db.execute(cmd)
try:
taxid, spname, score = result.fetchone()
except:
pass
else:
taxid = int(taxid)
else:
taxid = int(taxid)
norm_score = 1 - (float(score)/len(name))
if taxid:
print("FOUND! %s taxid:%s score:%s (%s)" %(spname, taxid, score, norm_score))
return taxid, spname, norm_score
and
def get_name_translator(self, names):
"""
Given a list of taxid scientific names, returns a dictionary translating them into their corresponding taxids.
Exact name match is required for translation.
"""
name2id = {}
#name2realname = {}
name2origname = {}
for n in names:
name2origname[n.lower()] = n
names = set(name2origname.keys())
query = ','.join(['"%s"' %n for n in six.iterkeys(name2origname)])
cmd = 'select spname, taxid from species where spname IN (%s)' %query
result = self.db.execute('select spname, taxid from species where spname IN (%s)' %query)
for sp, taxid in result.fetchall():
oname = name2origname[sp.lower()]
name2id.setdefault(oname, []).append(taxid)
#name2realname[oname] = sp
missing = names - set([n.lower() for n in name2id.keys()])
if missing:
query = ','.join(['"%s"' %n for n in missing])
result = self.db.execute('select spname, taxid from synonym where spname IN (%s)' %query)
for sp, taxid in result.fetchall():
oname = name2origname[sp.lower()]
name2id.setdefault(oname, []).append(taxid)
#name2realname[oname] = sp
return name2id
>> All of these codes run fine, my problem is figuring out how to get results(valid values/variables for ?'s) from a non-accurate species name into an accurate species name using:
from ete3 import NCBITaxa
ncbi= NCBITaxa
fuzzy_name = ncbi.get_fuzzy_name_translation(?,?,?)
print (dog?,0.9?)
Also how to get taxonomy IDs using
from ete3 import NCBITaxa
ncbi= NCBITaxa
taxid_name = ncbi.get_name_translator(?)
print (?)
I ran
help(get_fuzzy_name_translation)
help(get_name_translator)
and got
Help on function get_fuzzy_name_translation in module __main__:
get_fuzzy_name_translation(self, name, sim=0.9)
Given an inexact species name, returns the best match in the NCBI database of taxa names.
:argument 0.9 sim: Min word similarity to report a match (from 0 to 1).
:return: taxid, species-name-match, match-score
Help on function get_name_translator in module __main__:
get_name_translator(self, names)
Given a list of taxid scientific names, returns a dictionary translating them into their corresponding taxids.
Exact name match is required for translation.
I apologize for the long post and bad formatting of codes, I tried my best to give information as clear as possible.
Any pointers would be great! I'm working on it everyday to try and figure it out.