start from a list of protein accessions (i.e. WP_xxx) and get a correspondent list of assembly accessions (GCF_xxx).
If the intent here is to obtain a list of all GCF accessions that a particular WP is annotated on, then the answer above is not correct. It is fetching the entire list of GCFs for an organism irrespective of whether the WP is annotated on those assemblies or not. A couple of strategies to get the data you are looking for:
1. Use the IPG report
The IPG report for a WP accession will have the GCF accession. For example, here is the IPG table for WP_000134546. This particular WP appears to be annotated only on one assembly. For an example that is annotated on a number of assemblies, take a look at WP_003547430.1. You can simply download the entire table in CSV format by using the 'Send To' menu at the top right corner and choose 'File' as an option.
To do the same from the command line using Entrez Direct, you can repurpose the following code:
## returns tab-delimited data
efetch -db ipg -id 'WP_000134546.1' -format ipg
Id Source Nucleotide Accession Start Stop Strand Protein Protein Name Organism Strain Assembly
33217001 RefSeq NZ_AHLT01000014.1 13020 13304 - WP_000134546.1 hypothetical protein Staphylococcus aureus subsp. aureus IS-122 IS-122 GCF_000247415.1
The IPG output is available as XML too; you run the same command with -format ipg -mode xml
. IPG report will give you a row for every single protein accession out there, whether there is a WP annotated there or not. So it can be quite long for some proteins. But it’s easy enough to parse for just the WPs.
2. Use elink
from protein to nuccore to assembly
Here, you just hop through a couple of elink
steps to go from NCBI Protein to NCBI Nucleotide to NCBI Assembly. This will return information about the assemblies but miss out on the rich information in the IPG report such as the nucleotide accession, range, strand, etc.
esearch -db protein -query 'WP_000134546' \
| elink -target nuccore -name protein_nuccore_wp \
| elink -db nuccore -target assembly -name nuccore_assembly \
| esummary \
| xtract -pattern DocumentSummary -element AssemblyAccession
GCF_000247415.1
You can find more information about the elink
descriptions here. NCBI has the following information on how to find related data starting with WPs.
WP*
records represents a single, non-redundant, protein sequence which may be annotated on many different RefSeq genomes from the same, or different, species.You could use Entrezdirect but be ready to get a bunch of "GCF*" records for each
WP
accession that may be from different species. A random example below generates 16K genomes.If you want to limit to a specific organism.
Just a couple of points (in addition the answer below):
elink
with-name protein_taxonomy
parameter.In light of your comments I have decided to move my answer to a comment since indeed it is not answering the original question. It is great that we have someone from NCBI with deep knowledge about Entrezdirect participating here, since much of this information is undocumented (e.g.
-name protein_taxonomy
or use ofipg
database forefetch
, there is no mention of it in in-line help) or is documented in a form that is not easily accessible to end users. Don't get me wrong. I like Entrezdirect and think it is one thing that NCBI has that trumps Ensembl.So the only reason that protein has a
WP*
(which I just randomly chose and which points to a unique protein) is because it is a hypothetical protein since it does not satisfy "on many different RefSeq genomes from the same, or different, species" this criterion? Should it not have been aXP*
accession instead?Yes, indeed, the documentation is not always very easy to discover. As I had mentioned in my response, the entire list of Entrez Link descriptions can be found here (the bit.ly link for it to remember is http://bit.ly/elink-descriptions). As for the databases available for search using
esearch
, you can invokeeinfo -dbs
to list the databases. Additionally, you can useeinfo -db ipg
(or any other legitimate value fordb
) to see information about the fields that can be used in the search query, link names and their descriptions, etc. Personally, I find the XML output ofeinfo -db ipg
a bit cumbersome to 'read' but it came in handy more than few times in the past.As described in this page and the other pages under the 'Related documentation', NCBI's prokaryote annotation pipeline (PGAP) now creates proteins with WP accessions for ALL proteins annotated on prokaryote assemblies. If there is another WP protein with the exact same sequence, then a new WP is not created and the old one is reused. Otherwise, a new WP is created. In this instance, only a single current RefSeq assembly encodes a protein matching that particular WP. The lack of information to assign an informative protein name does not factor into WP assignment.
Thanks to for all the replies, these were very useful!