Finding aliases and acronyms using entrez?
2
0
Entering edit mode
2.1 years ago
limitless ▴ 10

Hello everyone,

I was wondering if I could get some help using Entrez in Python. I currently have a list of proteins and their acronyms, and I would like to find a way to tell Python to equate the names to the acronyms, I thought of using NCBI's database for that. I was wondering if anyone here knew how I could approach this question and how I would go about coding it. I figured I could search each protein, but I'm not sure how to find the aliases and how that would go after that.

Thank you

ncbi python entrez • 1.5k views
ADD COMMENT
1
Entering edit mode

Do not delete posts that have received feedback - this one even has an answer that works really well. To acknowledge a post as resolved, accept the answer(s) that work for you (I've done it for you this time). Deleting a post is not something you should do unless it was a mistake to ask the question in the first place (inappropriate content or confidential information etc.)

ADD REPLY
2
Entering edit mode
2.1 years ago
GenoMax 147k

You can use EntrezDirect for this (there is a python version, if you want that specifically):

$ esearch -db gene -query "TP53 [gene]" | efetch -format ft 

1. TP53
Official Symbol: TP53 and Name: tumor protein p53 [Homo sapiens (human)]
Other Aliases: BCC7, BMFS5, LFS1, P53, TRP53
Other Designations: cellular tumor antigen p53; antigen NY-CO-13; mutant tumor protein 53; phosphoprotein p53; transformation-related protein 53; tumor protein 53; tumor supressor p53
Chromosome: 17; Location: 17p13.1
Annotation: Chromosome 17 NC_000017.11 (7668421..7687490, complement)
MIM: 191170
ID: 7157
ADD COMMENT
0
Entering edit mode

Thank you so much, this is beyond perfect!

ADD REPLY
0
Entering edit mode

Sorry to bother you again, but I keep trying to do this, and I am not getting any of the aliases.

handle = Entrez.esearch(db = "nucleotide", term = 'TP53[Gene]', retmax = "3")
rec_list = Entrez.read(handle)
handle.close()
print(rec_list['Count'])
print(len(rec_list['IdList']))
print(rec_list)

id_list = rec_list['IdList']
handles = Entrez.efetch(db = 'nucleotide', id = id_list, rettype = 'gb')

recs = list(SeqIO.parse(handles, 'gb'))
handles.close()
print(recs)

and this is what I get

[SeqRecord(seq=Seq('CTCCTTGGTTCAAGTAATTCTCCTGCCTCAGACTCCAGAGTAGCTGGGATTACA...AAT'), id='NG_017013.2', name='NG_017013', description='Homo sapiens tumor protein p53 (TP53), RefSeqGene (LRG_321) on chromosome 17', dbxrefs=[]), SeqRecord(seq=Seq('TTTCCCCTCCCACGTGCTCACCCTGGCTAAAGTTCTGTAGCTTCAGTTCATTGG...AAA'), id='NM_001127233.1', name='NM_001127233', description='Mus musculus transformation related protein 53 (Trp53), transcript variant 2, mRNA', dbxrefs=[]), SeqRecord(seq=Seq('TTTCCCCTCCCACGTGCTCACCCTGGCTAAAGTTCTGTAGCTTCAGTTCATTGG...AAA'), id='NM_011640.3', name='NM_011640', description='Mus musculus transformation related protein 53 (Trp53), transcript variant 1, mRNA', dbxrefs=[])]
ADD REPLY
0
Entering edit mode

Looks like this follow-up about converting the command route to Python was moved over to post 'using efetch in Python'. Just tagging this here for anyone following along.

ADD REPLY
1
Entering edit mode
2.0 years ago
MirianT_NCBI ▴ 760

Hi, You can use NCBI Datasets for that query.

If you install datasets using conda, both datasets and dataformat will be installed. If you prefer to download the binaries yourself, just be sure to download both.

For example: TP53

datasets summary gene symbol tp53 --as-json-lines | \
dataformat tsv gene --fields symbol,gene-id,ensembl-geneids,name-id,omim-ids,swissprot-accessions,synonyms,replaced-gene-id

The datasets command will retrieve metadata information for the requested symbol. The default taxon here is human. If you want information for another species (dog, for example), you should add the flag --taxon dog to the first command. You also have options to retrieve metadata by accession or gene-id.

The dataformat command will print the output as tsv (options are tsv or excel). Since this is a gene summary, you need to add gene and here I selected the fields that I think were relevant for your query. For more information about the available fields and other dataformat options, please check out the documentation in this link.

This command will print the following output (that can be redirected to a file). To make it easier to read on the terminal, I added the column command, but that's not necessary obviously.

datasets summary gene symbol tp53 --as-json-lines | \
dataformat tsv gene \
 --fields symbol,gene-id,ensembl-geneids,name-id,omim-ids,swissprot-accessions,synonyms,replaced-gene-id | \
column -t -s$'\t'

Symbol  NCBI GeneID  Ensembl GeneIDs  Nomenclature ID  OMIM IDs  SwissProt Accessions  Synonyms                   Replaced NCBI GeneID
TP53    7157         ENSG00000141510  HGNC:11998       191170    P04637                P53,BCC7,LFS1,BMFS5,TRP53

Please feel free to reach out if you have any additional questions. I hope this helps!

ADD COMMENT

Login before adding your answer.

Traffic: 3038 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6