Hi,
Is there a way to batch retrieve Gene Sequences using Protein IDs? Tried Batch Entrez but it wont provide the gene seq when I use protein IDs. Thank you,
Venura
Hi,
Is there a way to batch retrieve Gene Sequences using Protein IDs? Tried Batch Entrez but it wont provide the gene seq when I use protein IDs. Thank you,
Venura
With Entrez-direct to get coding sequence:
$ esearch -db protein -query "NP_571131.1" | efetch -format fasta_cds_na
>lcl|NM_131056.1_cds_NP_571131.1_1 [gene=ins] [db_xref=GeneID:30262,ZFIN:ZDB-GENE-980526-110] [protein=insulin preproprotein] [protein_id=NP_571131.1] [location=45..371] [gbkey=CDS]
ATGGCAGTGTGGCTTCAGGCTGGTGCTCTGTTGGTCCTGTTGGTCGTGTCCAGTGTAAGCACTAACCCAG
GCACACCGCAGCACCTGTGTGGATCTCATCTGGTCGATGCCCTTTATCTGGTCTGTGGCCCAACAGGCTT
CTTCTACAACCCCAAGAGAGACGTTGAGCCCCTTCTGGGTTTCCTTCCTCCTAAATCTGCCCAGGAAACT
GAGGTGGCTGACTTTGCATTTAAAGATCATGCCGAGCTGATAAGGAAGAGAGGCATTGTAGAGCAGTGCT
GCCACAAACCCTGCAGCATCTTTGAGCTGCAGAACTACTGTAACTGA
In case you want the protein sequence instead:
$ efetch -db protein -id "NP_571131.1" -format fasta
>NP_571131.1 insulin preproprotein [Danio rerio]
MAVWLQAGALLVLLVVSSVSTNPGTPQHLCGSHLVDALYLVCGPTGFFYNPKRDVEPLLGFLPPKSAQET
EVADFAFKDHAELIRKRGIVEQCCHKPCSIFELQNYCN
Thank you so much for the quick reply! However, I am getting the following error;
400 Bad Request No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/e&rettype=fasta&retmode=text&edirect=7.40&tool=edirect&email=venura@felis' Result of do_post http request is $VAR1 = bless( {
'_protocol' => 'HTTP/1.1',
'_rc' => 400,
'_content' => 'WebEnv parameter is required ',
'_msg' => 'Bad Request',
'_headers' => bless( {
'ncbi-phid' => '939B6CAE78716D2500003EF8B6BDDF4E.1.1.m_1',
'connection' => 'close',
'client-ssl-warning' => 'Peer certificate not verified',
'content-security-policy' => 'upgrade-insecure-requests',
'client-date' => 'Sat, 08 Feb 2020 23:31:07 GMT',
'x-ua-compatible' => 'IE=Edge',
'access-control-allow-origin' => '*',
'date' => 'Sat, 08 Feb 2020 23:31:07 GMT',
'server' => 'Finatra',
'client-ssl-cert-issuer' => '/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert SHA2 High
Above is first half of the error I am getting. Since the full error log exceed 5000 characters I posted only first half.
Do you get the same error using both commands?
I added those 4 ids to IDs.txt, used the second command and I got sequences in resultfile.
Could you please make sure the first line is not empty and there's no spaces after the ids in the IDs.txt? You can use this command to clean them if there's any:
perl -pe 's/\r//' IDs.txt > IDs_clean.txt
Also adding sleep 5 which pauses for 5 seconds between each retrieval might help:
cat IDs.txt | while read line; do efetch -db protein -id ${line} -format fasta >> resultfile; sleep 5 ; done
If 5 seconds isn't enough, you can try other numbers like sleep 10, sleep 15, sleep 20 to see if they work.
You can remove the >>resultfile for now and see if you get any results:
[fsharifi@fsharifi ~]$ cat -et IDs.txt
YP_009507925$
AGE13900$
YP_009505327$
YP_006468898$
[fsharifi@fsharifi ~]$ cat IDs.txt | while read line; do efetch -db protein -id ${line} -format fasta ; sleep 5 ; done
>YP_009507925.1 RNA-dependent RNA polymerase [Actinidia chlorotic ringspot-associated virus]
MSESIERIKKAECDKVAQDVKDGKVFDNDVLSRFLSLVGKPRNRYTISSKPKEVEAIYKQCISSEGFMSE
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This might help:
Question: download protein sequences from NCBI
download protein sequences from NCBI