Efetch from eutilities gives '400 Bad Request'; how to fix this?
2
0
Entering edit mode
2.9 years ago
Shraddha ▴ 90

Hi folks,

I have several lists of proteins that I'd like to search NCBI's protein database for. Particularly, I want the title and comment, and I found that one way that works. at least when I run them individually:

esearch -db protein -query NP_189017.1 | esummary | xtract -pattern DocumentSummary -element Title which gives the output:

Leucine-rich repeat protein kinase family protein [Arabidopsis thaliana]

and

efetch -db protein -id 'NP_189017.1' | sed -n '/comment/,/",/p' , which gives:

comment "Leucine-rich repeat protein kinase family protein; FUNCTIONS IN: protein serine/threonine kinase activity, protein kinase activity, ATP binding; INVOLVED IN: protein amino acid phosphorylation; LOCATED IN: plasma membrane; EXPRESSED IN: 26 plant structures; EXPRESSED DURING: 15 growth stages; CONTAINS InterPro DOMAIN/s: Protein kinase, ATP binding site (InterPro:IPR017441), Protein kinase, catalytic domain (InterPro:IPR000719), Leucine-rich repeat-containing N-terminal domain, type 2 (InterPro:IPR013210), Leucine-rich repeat (InterPro:IPR001611), Serine/threonine-protein kinase-like domain (InterPro:IPR017442), Protein kinase-like domain (InterPro:IPR011009), Serine/threonine-protein kinase, active site (InterPro:IPR008271); BEST Arabidopsis thaliana protein match is: transmembrane kinase 1 (TAIR:AT1G66150.1); Has 176104 Blast hits to 138784 proteins in 5021 species: Archae - 174; Bacteria - 16889; Metazoa - 56819; Fungi - 11325; Plants - 68733; Viruses - 454; Other Eukaryotes - 21710 (source: NCBI BLink).",

These results are exactly right. When I try to automate this with a script, though, I run into issues. My script is as follows:

while read -r line; do  
title=$(esearch -db protein -query  $line | esummary | xtract -pattern DocumentSummary -element Title)
comment=$(efetch -db protein -id $line | sed -n '/comment/,/.",/p')
echo -n $line $title $comment \n >> protein_descriptions.txt; done < proteinlist.txt

I get the following error:

400 Bad Request
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=NP_189017.1&rettype=native&retmode=text&edirect_os=linux&edirect=13.9&tool=edirect&email=incorrect@email-new'

I've not given an email, it just grabbed that from my system, I guess. Either way, I've checked multiple protein IDs individually and they should all work. So it's not an issue of the IDs being invalid.

esearch generally works, but I really need the comments which I get from efetch. With that said, how can I fix the 400 bad request issue? Or is there a better tool for that, than efetch?

linux while-loop eutilities • 2.9k views
ADD COMMENT
0
Entering edit mode

There used to be a limit of 3 queries per second on anonymous requests and 10 queries per second on requests with user token via the e-utils API. Easiest hack would probably be to add a one second delay like sleep 1, in case you don't want to set the authentication. Though at least the first request should work then...

ADD REPLY
2
Entering edit mode
2.9 years ago
GenoMax 148k

This is a public resource and any time you want to do multiple queries it is safer to slow these down. I hope you have signed up for NCBI API Key since those would be needed when you are doing multiple queries.

You can also try the following one step query:

$ esearch -db protein -query NP_189017.1 | efetch -format xml | xtract -pattern Seq-entry -element Seqdesc_title,Seq-feat_comment
Leucine-rich repeat protein kinase family protein; FUNCTIONS IN: protein serine/threonine kinase activity, protein kinase activity, ATP binding; INVOLVED IN: protein amino acid phosphorylation; LOCATED IN: plasma membrane; EXPRESSED IN: 26 plant structures; EXPRESSED DURING: 15 growth stages; CONTAINS InterPro DOMAIN/s: Protein kinase, ATP binding site (InterPro:IPR017441), Protein kinase, catalytic domain (InterPro:IPR000719), Leucine-rich repeat-containing N-terminal domain, type 2 (InterPro:IPR013210), Leucine-rich repeat (InterPro:IPR001611), Serine/threonine-protein kinase-like domain (InterPro:IPR017442), Protein kinase-like domain (InterPro:IPR011009), Serine/threonine-protein kinase, active site (InterPro:IPR008271); BEST Arabidopsis thaliana protein match is: transmembrane kinase 1 (TAIR:AT1G66150.1); Has 176104 Blast hits to 138784 proteins in 5021 species: Archae - 174; Bacteria - 16889; Metazoa - 56819; Fungi - 11325; Plants - 68733; Viruses - 454; Other Eukaryotes - 21710 (source: NCBI BLink).
ADD COMMENT
0
Entering edit mode

Thanks! I have an API key, but I'm not sure how to use it. Should it be incorporated into the command?
Like esearch -db protein -query query api_key=key | ...? That isn't working for me, so I assume it's a syntax issue but I can't work it out.

When I tried to give the key to bash as a standing variable, that didn't work either. I just ran api_key=key and then tried running the one-liner you've provided above, but then as part of a loop. It did work, but only for the first line. So my output file (which should have 10 titles and comments), just has the same output you got above. How can I address that?

ADD REPLY
0
Entering edit mode

You can put

export NCBI_API_KEY=unique_api_key

in your shell initialization file. Even with the API Key slow your requests down, if you are doing a ton of them.

ADD REPLY
0
Entering edit mode

I put that in my bashrc file, still no change. Running it just gives the output for one protein ID. In this example input file, I just have 10 lines, and none of my protein lists exceeds say, 50 lines. Is it still necessary to slow it down with sleep?

ADD REPLY
0
Entering edit mode

put that in my bashrc file, still no change. Running it just gives the output for one protein ID.

What does that mean?

$ more id
NP_189017.1
NP_005106.2
NP_904116.1

$ for i in `cat id`; do printf "${i}\t"; esearch -db protein -query ${i} | efetch -format xml | xtract -pattern Seq-entry -element Seqdesc_title,Seq-feat_comment; done
NP_189017.1 Leucine-rich repeat protein kinase family protein; FUNCTIONS IN: protein serine/threonine kinase activity, protein kinase activity, ATP binding; INVOLVED IN: protein amino acid phosphorylation; LOCATED IN: plasma membrane; EXPRESSED IN: 26 plant structures; EXPRESSED DURING: 15 growth stages; CONTAINS InterPro DOMAIN/s: Protein kinase, ATP binding site (InterPro:IPR017441), Protein kinase, catalytic domain (InterPro:IPR000719), Leucine-rich repeat-containing N-terminal domain, type 2 (InterPro:IPR013210), Leucine-rich repeat (InterPro:IPR001611), Serine/threonine-protein kinase-like domain (InterPro:IPR017442), Protein kinase-like domain (InterPro:IPR011009), Serine/threonine-protein kinase, active site (InterPro:IPR008271); BEST Arabidopsis thaliana protein match is: transmembrane kinase 1 (TAIR:AT1G66150.1); Has 176104 Blast hits to 138784 proteins in 5021 species: Archae - 174; Bacteria - 16889; Metazoa - 56819; Fungi - 11325; Plants - 68733; Viruses - 454; Other Eukaryotes - 21710 (source: NCBI BLink).
NP_005106.2 isoform 1 is encoded by transcript variant 2    major polyA site    hexamer: AATACA
NP_904116.1 ORF within trnK intron  ATPase alpha subunit    ATP synthase CF0 C chain    ATP synthase CF0 A chain    one of four subunits of the minimal PEP RNA polymerase catalytic core   one of four subunits of the minimal PEP RNA polymerase catalytic core   ycf6    CP43    YCF9    PsaB    PsaA    ATP synthase CF1 beta subunit   component of cytochrome b6/f complex    ycf7    ycf7    required for the either the stability or assembly of the cytochrome b6/f complex    PART 81 387 start 383 end 1 PSII 47 kDa protein photosystem II phosphoprotein   hypothetical protein RF2    ACG initiation codon    9 kDa protein   hypothetical protein RF2
ADD REPLY
1
Entering edit mode

I mean, I added the export line to my shell initialization file, but the output it gave me was for just one protein ID. I replaced the variable creating parts in my original loop with the one-liner you provided, and it only gave me the first result before stopping:

NP_189017.1 Leucine-rich repeat protein kinase family protein; FUNCTIONS IN: protein serine/threonine kinase activity, protein kinase activity, ATP binding; INVOLVED IN: protein amino acid phosphorylation; LOCATED IN: plasma membrane; EXPRESSED IN: 26 plant structures; EXPRESSED DURING: 15 growth stages; CONTAINS InterPro DOMAIN/s: Protein kinase, ATP binding site (InterPro:IPR017441), Protein kinase, catalytic domain (InterPro:IPR000719), Leucine-rich repeat-containing N-terminal domain, type 2 (InterPro:IPR013210), Leucine-rich repeat (InterPro:IPR001611), Serine/threonine-protein kinase-like domain (InterPro:IPR017442), Protein kinase-like domain (InterPro:IPR011009), Serine/threonine-protein kinase, active site (InterPro:IPR008271); BEST Arabidopsis thaliana protein match is: transmembrane kinase 1 (TAIR:AT1G66150.1); Has 176104 Blast hits to 138784 proteins in 5021 species: Archae - 174; Bacteria - 16889; Metazoa - 56819; Fungi - 11325; Plants - 68733; Viruses - 454; Other Eukaryotes - 21710 (source: NCBI BLink).

I'm not sure why that doesn't work, but your method with the for loop does work perfectly. I added in a sleep 1 for safety, but otherwise that's exactly what I need. Thank you!

ADD REPLY
2
Entering edit mode
2.9 years ago
Michael 55k

Try this bash script first, I am not sure where your error is coming from, but it should help to debug the error

#!/bin/bash
set -x # see command calls

while IFS= read -r line
do
  echo $line
  ### adding <</dev/null prevents my esearch from accidentally  slurping up all input and just reading the first line
  title=$(esearch -db protein -query  "$line" <</dev/null | esummary | xtract -pattern DocumentSummary -element Title)
  comment=$(efetch -db protein -id "$line" | sed -n '/comment/,/.",/p')
  echo -n $line $title $comment >> protein_descriptions.txt
  echo >> protein_descriptions.txt # add newline 
done < proteinlist.txt

Output:

   NP_189017.1 Leucine-rich repeat protein kinase family protein [Arabidopsis thaliana] comment "Leucine-rich repeat protein kinase family protein; FUNCTIONS IN: protein serine/threonine kinase activity, protein kinase activity, ATP binding; INVOLVED IN: protein amino acid phosphorylation; LOCATED IN: plasma membrane; EXPRESSED IN: 26 plant structures; EXPRESSED DURING: 15 growth stages; CONTAINS InterPro DOMAIN/s: Protein kinase, ATP binding site (InterPro:IPR017441), Protein kinase, catalytic domain (InterPro:IPR000719), Leucine-rich repeat-containing N-terminal domain, type 2 (InterPro:IPR013210), Leucine-rich repeat (InterPro:IPR001611), Serine/threonine-protein kinase-like domain (InterPro:IPR017442), Protein kinase-like domain (InterPro:IPR011009), Serine/threonine-protein kinase, active site (InterPro:IPR008271); BEST Arabidopsis thaliana protein match is: transmembrane kinase 1 (TAIR:AT1G66150.1); Has 176104 Blast hits to 138784 proteins in 5021 species: Archae - 174; Bacteria - 16889; Metazoa - 56819; Fungi - 11325; Plants - 68733; Viruses - 454; Other Eukaryotes - 21710 (source: NCBI BLink).",
   NP_005106.2 major vault protein isoform 1 [Homo sapiens] comment "GeneRIF: MVP Expression Facilitates Tumor Cell Proliferation and Migration Supporting the Metastasis of Colorectal Cancer Cells." }, pub { pub { pmid 33609362, article { title { name "Proteomic analyses identify major vault protein as a prognostic biomarker for fatal prostate cancer." }, authors { names std { { name ml "Ramberg H", comment "GeneRIF: Proteomic analyses identify major vault protein as a prognostic biomarker for fatal prostate cancer." }, pub { pub { pmid 32894437, article { title { name "Y-box binding protein 1 (YB-1) promotes gefitinib resistance in lung adenocarcinoma cells by activating AKT signaling and epithelial-mesenchymal transition through targeting major vault protein (MVP)." }, authors { names std { { name ml "Lou L", comment "GeneRIF: Y-box binding protein 1 (YB-1) promotes gefitinib resistance in lung adenocarcinoma cells by activating AKT signaling and epithelial-mesenchymal transition through targeting major vault protein (MVP)." }, pub { pub { pmid 32296183, article { title { name "A reference map of the human binary protein interactome." }, authors { names std { { name ml "Luck K", comment "GeneRIF: MVP gene expression regulated by alternative splicin
ADD COMMENT
0
Entering edit mode

Thanks! This is a bit silly but, do you know why all the output is printed on the same line despite the \n? Here's how it looks:

 head testout.txt

NP_189017.1 Leucine-rich repeat protein kinase family protein [Arabidopsis thaliana] nNP_001154706.1 DEAD/DEAH box RNA helicase family protein [Arabidopsis thaliana] comment "One of two genes encoding an ATP-dependent RNA helicase that localizes predominantly to euchromatic regions of Arabidopsis nuclei, and associates with genes transcribed by RNA polymerase II independently from the presence of introns. It is not detected at non-transcribed loci. It interacts with ssRNA, dsRNA and dsDNA, but not with ssDNA. Its ATPase activity is stimulated by RNA and dsDNA and its ATP-dependent RNA helicase activity unwinds dsRNA but not dsDNA.", comment "DEAD/DEAH box RNA helicase family protein; FUNCTIONS IN: helicase activity, nucleic acid binding, ATP binding, ATP-dependent helicase activity; INVOLVED IN: biological_process unknown; EXPRESSED IN: male gametophyte, guard cell, pollen tube; EXPRESSED DURING: M germinated pollen stage; CONTAINS InterPro DOMAIN/s: DNA/RNA helicase, DEAD/DEAH box type, N-terminal (InterPro:IPR011545), RNA helicase, DEAD-box type, Q motif (InterPro:IPR014014), DEAD-like helicase, N-terminal (InterPro:IPR014001), DNA/RNA helicase, C-terminal (InterPro:IPR001650), Helicase, superfamily 1/2, ATP-binding domain (InterPro:IPR014021); BEST Arabidopsis thaliana protein match is: DEAD/DEAH box RNA helicase family protein (TAIR:AT5G11170.1); Has 53691 Blast hits to 42377 proteins in 3051 species: Archae - 965; Bacteria - 28515; Metazoa - 7537; Fungi - 5428; Plants - 3340; Viruses - 56; Other Eukaryotes - 7850 (source: NCBI BLink).",
ADD REPLY
0
Entering edit mode

Hi, I had to tweak it a bit, so the output gets separated correctly, you have to use echo -ne to turn on escape parsing. However, in this script it doesn't seem to work, possibly to some escape strings in the output, so it is safer to simply add another echo command.

ADD REPLY
0
Entering edit mode

I see! Thanks, this works well for me.

ADD REPLY
0
Entering edit mode

Should be working now. If you still get a 400 from time to time, add a sleep 1 in the loop.

ADD REPLY
0
Entering edit mode

Should sleep 1 be the last line before done?

ADD REPLY
0
Entering edit mode

Yes, however, I didn't need it in my experiment. It might be nice still if you are trying to retrieve thousands of proteins.

ADD REPLY
0
Entering edit mode

I added it to my loop, and I have 9 results in my out file, with one 400 error despite the sleep line. The first few lines of the error look like this:

400 Bad Request
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=NP_189017.1&rettype=native&retmode=text&edirect_os=linux&api_key=xxxx&edirect=13.9&tool=edirect&email=ubuntu@sfischer-ai-neu'
Result of do_post http request is
$VAR1 = bless( {
                 '_headers' => bless( {
                                        'client-ssl-socket-class' => 'IO::Socket::SSL',
                                        'cache-control' => 'private',
                                        'client-ssl-cert-subject' => '/C=US/ST=Maryland/L=Bethesda/O=National Library of Medicine/CN=*.ncbi.nlm.nih.gov',
                                        'client-peer' => '130.14.29.110:443',
                                        'referrer-policy' => 'origin-when-cross-origin',
                                        'client-ssl-cert-issuer' => '/C=US/O=DigiCert Inc/CN=DigiCert TLS RSA SHA256 2020 CA1',
                                        'access-control-expose-headers' => 'X-RateLimit-Limit,X-RateLimit-Remaining',

and this is in spite of the api key in my bashrc file. Is it fair to assume that this 400 indicates that the answer isn't there?

Note: I redacted your API key from the error.

ADD REPLY
0
Entering edit mode

Is it fair to assume that this 400 indicates that the answer isn't there?

That is possible. Show us the ID's that are generating this error.

NP_189017.1 in error above generates output so that is not the problem one.

ADD REPLY
0
Entering edit mode

That's the only one that threw up an error. The other solution does work without a hitch though, so I think I'll just use that. Running the IDs individually, none of them had problems. (Also, thanks for catching the API key, my bad!)

ADD REPLY
0
Entering edit mode

Ok, so I can assume this is solved? If not, I need the full list of ids to retrieve in proteinlist.txt for testing it.

ADD REPLY
0
Entering edit mode

Yes, it's sorted now. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2062 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6