Hi folks,
I have several lists of proteins that I'd like to search NCBI's protein database for. Particularly, I want the title and comment, and I found that one way that works. at least when I run them individually:
esearch -db protein -query NP_189017.1 | esummary | xtract -pattern DocumentSummary -element Title
which gives the output:
Leucine-rich repeat protein kinase family protein [Arabidopsis thaliana]
and
efetch -db protein -id 'NP_189017.1' | sed -n '/comment/,/",/p'
, which gives:
comment "Leucine-rich repeat protein kinase family protein; FUNCTIONS IN: protein serine/threonine kinase activity, protein kinase activity, ATP binding; INVOLVED IN: protein amino acid phosphorylation; LOCATED IN: plasma membrane; EXPRESSED IN: 26 plant structures; EXPRESSED DURING: 15 growth stages; CONTAINS InterPro DOMAIN/s: Protein kinase, ATP binding site (InterPro:IPR017441), Protein kinase, catalytic domain (InterPro:IPR000719), Leucine-rich repeat-containing N-terminal domain, type 2 (InterPro:IPR013210), Leucine-rich repeat (InterPro:IPR001611), Serine/threonine-protein kinase-like domain (InterPro:IPR017442), Protein kinase-like domain (InterPro:IPR011009), Serine/threonine-protein kinase, active site (InterPro:IPR008271); BEST Arabidopsis thaliana protein match is: transmembrane kinase 1 (TAIR:AT1G66150.1); Has 176104 Blast hits to 138784 proteins in 5021 species: Archae - 174; Bacteria - 16889; Metazoa - 56819; Fungi - 11325; Plants - 68733; Viruses - 454; Other Eukaryotes - 21710 (source: NCBI BLink).",
These results are exactly right. When I try to automate this with a script, though, I run into issues. My script is as follows:
while read -r line; do
title=$(esearch -db protein -query $line | esummary | xtract -pattern DocumentSummary -element Title)
comment=$(efetch -db protein -id $line | sed -n '/comment/,/.",/p')
echo -n $line $title $comment \n >> protein_descriptions.txt; done < proteinlist.txt
I get the following error:
400 Bad Request
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=NP_189017.1&rettype=native&retmode=text&edirect_os=linux&edirect=13.9&tool=edirect&email=incorrect@email-new'
I've not given an email, it just grabbed that from my system, I guess. Either way, I've checked multiple protein IDs individually and they should all work. So it's not an issue of the IDs being invalid.
esearch generally works, but I really need the comments which I get from efetch. With that said, how can I fix the 400 bad request issue? Or is there a better tool for that, than efetch?
There used to be a limit of 3 queries per second on anonymous requests and 10 queries per second on requests with user token via the e-utils API. Easiest hack would probably be to add a one second delay like
sleep 1
, in case you don't want to set the authentication. Though at least the first request should work then...