Hi, I hope you're well. I have a list of (1000+) Protein RefSeq IDs returned by BLAST and I need to convert them to Entrez gene IDs. Is there a way to do so? Thank you for the help!
Hi, I hope you're well. I have a list of (1000+) Protein RefSeq IDs returned by BLAST and I need to convert them to Entrez gene IDs. Is there a way to do so? Thank you for the help!
This would work:
$ esearch -db protein -query "NP_001026105" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Id
420087
For a lot of them you would do:
$ cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Id
You should stay away from the gi
numbers since they are now deprecated for end-user use.
Thank you for the help! Another question in this thread. Is there a way to get the gene descriptions (the functional role of each gene) for each entry? Something like for refseq ID: "NP_001026015.1", corresponding to gene symbol "AAR2", the description returned is "AAR2 splicing factor homolog [Source:NCBI gene;Acc:419118]"
I also worried that this command may take too long to run and terminal may time out with too many entries (currently around ~1500). Is this a legitimate concern?
$ esearch -db protein -query "NP_001026015" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description
AAR2 AAR2 splicing factor homolog
Sign up for and use NCBI API key as described here. That should allow you to go through your list.
Thanks! How would I run this for a file with multiple ref sequences? I feel that it will be similar to your reply above using epost, but am not certain how to implement this.
I just created the API key. However, how do I use the API with these calls, as the documentation is a little confusing.
You can export a variable in the terminal you are doing these searches in by export NCBI_API_KEY='your_key_string'
. You can also add this to your shell initialization file so it is exported when you log in.
$ cat file.txt | epost -db protein -format acc | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description
for line in $(cat accession_file); do printf ${line}"\t"; esearch -db protein -query $line | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,Description; done
NP_001026015 AAR2 AAR2 splicing factor homolog
NP_001026025 TTPAL alpha tocopherol transfer protein like
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hello tom5!
You have already received answers for this in your last question :
We have closed your question to allow us to keep similar content in the same thread.
If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.
Cheers!
Hi, I apologize for the similar question. However, my previous question dealt with converting protein refSeq IDs to Ensembl or Entrez gene accessions. I am now trying to convert from protein refSeq ID to Entrez gene ID. I know these are very similar tasks but I am not familiar enough with Entrez Direct to generalize the previous reply to this task.
Answer in: C: Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession will work. If it does not then can you post a couple of examples.
Yes, as an example, I want to convert the refseq ID 'NP_001026105.1' to the corresponding entrez gene ID: 420087. Is there a way to do so? My file has 1000+ refseq IDs (one per line) and I want to convert them to corresponding gene IDs. I'm sorry if you have already explained it.