Hi all,
I have many gene names I'm trying to map to entrez ids.
Right now I use the esearch module in biopython to query them 1 by 1 but this takes some time for 30000 gene names and ideally I would like it to be faster. I assume it would be faster if I could query 30000 at once instead of doing 30000 queries.
This is my current implementation:
for line in f.readlines():
line = [lineitem.strip('"') for lineitem in line.strip().split()]
gene = line[0]
# Search NCBI for existing gene ids
gene_id = None
handle = Entrez.esearch(db="gene", term="Homo sapiens[orgn] AND "+ gene + "[Gene Name]")
record = Entrez.read(handle)
try:
gene_id = record["IdList"][0]
except:
pass
handle.close()
this works but I would like a better solution. Is there a better way to approach this?
Kind regards, Julian
convert gene name to entrez id
The correct term for the identifiers you're calling "gene names" is HGNC Gene Symbols. HGNC also has "Gene Names", which are more like descriptions than one-word symbols.