Bulk Retrieval of Gene Sequences using Protein IDs (NCBI)
1
0
Entering edit mode
4.8 years ago
venura ▴ 70

Hi,

Is there a way to batch retrieve Gene Sequences using Protein IDs? Tried Batch Entrez but it wont provide the gene seq when I use protein IDs. Thank you,

Venura

NCBI Data retrival • 1.8k views
ADD COMMENT
1
Entering edit mode

This might help:

Question: download protein sequences from NCBI

download protein sequences from NCBI

ADD REPLY
5
Entering edit mode
4.8 years ago
GenoMax 147k

With Entrez-direct to get coding sequence:

$ esearch -db protein -query "NP_571131.1" | efetch -format fasta_cds_na
>lcl|NM_131056.1_cds_NP_571131.1_1 [gene=ins] [db_xref=GeneID:30262,ZFIN:ZDB-GENE-980526-110] [protein=insulin preproprotein] [protein_id=NP_571131.1] [location=45..371] [gbkey=CDS]
ATGGCAGTGTGGCTTCAGGCTGGTGCTCTGTTGGTCCTGTTGGTCGTGTCCAGTGTAAGCACTAACCCAG
GCACACCGCAGCACCTGTGTGGATCTCATCTGGTCGATGCCCTTTATCTGGTCTGTGGCCCAACAGGCTT
CTTCTACAACCCCAAGAGAGACGTTGAGCCCCTTCTGGGTTTCCTTCCTCCTAAATCTGCCCAGGAAACT
GAGGTGGCTGACTTTGCATTTAAAGATCATGCCGAGCTGATAAGGAAGAGAGGCATTGTAGAGCAGTGCT
GCCACAAACCCTGCAGCATCTTTGAGCTGCAGAACTACTGTAACTGA

In case you want the protein sequence instead:

$ efetch -db protein -id "NP_571131.1" -format fasta
>NP_571131.1 insulin preproprotein [Danio rerio]
MAVWLQAGALLVLLVVSSVSTNPGTPQHLCGSHLVDALYLVCGPTGFFYNPKRDVEPLLGFLPPKSAQET
EVADFAFKDHAELIRKRGIVEQCCHKPCSIFELQNYCN
ADD COMMENT
0
Entering edit mode

Thank you so much! 🙏

Quick question, How can I add multiple query IDs into a single command?

ADD REPLY
1
Entering edit mode

You can save all the ids in a file, say IDs.txt and then run

for l in `cat IDs.txt`; do  efetch -db protein -id $l -format fasta  >> resultfile ;  done

or

cat IDs.txt | while read line; do efetch -db protein -id ${line} -format fasta >> resultfile ; done
ADD REPLY
0
Entering edit mode

Thank you so much for the quick reply! However, I am getting the following error;

400 Bad Request No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/e&rettype=fasta&retmode=text&edirect=7.40&tool=edirect&email=venura@felis' Result of do_post http request is $VAR1 = bless( {
                 '_protocol' => 'HTTP/1.1',
                 '_rc' => 400,
                 '_content' => 'WebEnv parameter is required ',
                 '_msg' => 'Bad Request',
                 '_headers' => bless( {
                                        'ncbi-phid' => '939B6CAE78716D2500003EF8B6BDDF4E.1.1.m_1',
                                        'connection' => 'close',
                                        'client-ssl-warning' => 'Peer certificate not verified',
                                        'content-security-policy' => 'upgrade-insecure-requests',
                                        'client-date' => 'Sat, 08 Feb 2020 23:31:07 GMT',
                                        'x-ua-compatible' => 'IE=Edge',
                                        'access-control-allow-origin' => '*',
                                        'date' => 'Sat, 08 Feb 2020 23:31:07 GMT',
                                        'server' => 'Finatra',
                                        'client-ssl-cert-issuer' => '/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert SHA2 High

Above is first half of the error I am getting. Since the full error log exceed 5000 characters I posted only first half.

ADD REPLY
0
Entering edit mode

Are you running the command as posted by @Fatima? Are you using your own file with ID's in it, one on each line? Show the output of head -4 that_ID_file_you_are_using.

ADD REPLY
0
Entering edit mode

Yes I used hers. The command you gave me is working without any errors for a single ID.

Please see below

YP_009507925
AGE13900
YP_009505327
YP_006468898
ADD REPLY
0
Entering edit mode

Is there any results in the resultfile? I'm guessing there might be a limit on how many protein sequences you can retrieve in one batch.

ADD REPLY
0
Entering edit mode

No! Not a Single one! I even used tamu.edu proxy. Still getting the same error.

ADD REPLY
1
Entering edit mode

Do you get the same error using both commands?

I added those 4 ids to IDs.txt, used the second command and I got sequences in resultfile.

Could you please make sure the first line is not empty and there's no spaces after the ids in the IDs.txt? You can use this command to clean them if there's any:

perl -pe 's/\r//' IDs.txt > IDs_clean.txt
ADD REPLY
1
Entering edit mode

Also adding sleep 5 which pauses for 5 seconds between each retrieval might help:

cat IDs.txt | while read line; do efetch -db protein -id ${line} -format fasta >> resultfile; sleep 5 ; done

If 5 seconds isn't enough, you can try other numbers like sleep 10, sleep 15, sleep 20 to see if they work.

You can remove the >>resultfile for now and see if you get any results:

[fsharifi@fsharifi ~]$ cat -et IDs.txt
YP_009507925$
AGE13900$
YP_009505327$
YP_006468898$
[fsharifi@fsharifi ~]$ cat IDs.txt | while read line; do efetch -db protein -id ${line} -format fasta ; sleep 5 ;  done
>YP_009507925.1 RNA-dependent RNA polymerase [Actinidia chlorotic ringspot-associated virus]
MSESIERIKKAECDKVAQDVKDGKVFDNDVLSRFLSLVGKPRNRYTISSKPKEVEAIYKQCISSEGFMSE
ADD REPLY
0
Entering edit mode

It worked ! Thanks to both of you ! I used the following and obtained both protein and nucleotide sequences.

 cat nucleocapsid_clean.txt | while read line; do efetch -db protein -id ${line} -format fasta_cds_na >> resultfile ; done
ADD REPLY

Login before adding your answer.

Traffic: 2515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6