From a list of RefSeq protein IDs to assembly accessions
1
2
Entering edit mode
4.7 years ago
GZM ▴ 20

Hi All,

I am pretty new at script writing and terminal usage. I was wondering if there is an easy way to start from a list of protein accessions (i.e. WP_xxx) and get a correspondent list of assembly accessions (GCF_xxx). I do not need to download the sequences, all I need is a table or text format in which each protein is associated with its assembly, This format is required to then run a python program. I have read similar questions (although were not looking for exactly this to my knowledge), and I found that e-utils could allow this? Is it possible to do? Could someone help me with the command to write? Thanks in advance

sequence • 2.1k views
ADD COMMENT
2
Entering edit mode

WP* records represents a single, non-redundant, protein sequence which may be annotated on many different RefSeq genomes from the same, or different, species.

You could use Entrezdirect but be ready to get a bunch of "GCF*" records for each WP accession that may be from different species. A random example below generates 16K genomes.

$ esearch -db protein -query "WP_000134546" | efetch -format xml | xtract -pattern Seq-entry_seq -element Org-ref_taxname | xargs -n 1 sh -c 'esearch -db assembly -query "$0" | esummary | xtract -pattern DocumentSummary -element RefSeq' | wc -l
   16516

If you want to limit to a specific organism.

$ esearch -db protein -query "WP_000134546" | efetch -format xml | xtract -pattern Seq-entry_seq -element Org-ref_taxname | xargs -n 1 sh -c 'esearch -db assembly -query "$0" | esummary |  xtract -pattern DocumentSummary -if Organism -contains "Staphylococcus aureus" -element RefSeq' | head -5

GCF_011754615.1
GCF_011751615.1
GCF_011751495.1
GCF_011751485.1
GCF_011394825.1
ADD REPLY
1
Entering edit mode

Just a couple of points (in addition the answer below):

  1. Rather than extracting the organism and using that for a taxonomy lookup, you can get the same result using elink with -name protein_taxonomy parameter.
  2. The organism source recorded for a WP is the highest taxid in which that WP is found, but it does not mean that every RefSeq assembly under that taxid has that WP. Even for a WP with a species-level taxid, strain variation will result in a WP only being found in a subset of assemblies. So for WP_000134546.1, while it’s true that it is from Staphylococcus aureus and there are many thousands of RefSeq assemblies for that species, that particular WP is unique to a single assembly.
ADD REPLY
1
Entering edit mode

In light of your comments I have decided to move my answer to a comment since indeed it is not answering the original question. It is great that we have someone from NCBI with deep knowledge about Entrezdirect participating here, since much of this information is undocumented (e.g. -name protein_taxonomy or use of ipg database for efetch, there is no mention of it in in-line help) or is documented in a form that is not easily accessible to end users. Don't get me wrong. I like Entrezdirect and think it is one thing that NCBI has that trumps Ensembl.

So for WP_000134546.1, while it’s true that it is from Staphylococcus aureus and there are many thousands of RefSeq assemblies for that species, that particular WP is unique to a single assembly.

So the only reason that protein has a WP* (which I just randomly chose and which points to a unique protein) is because it is a hypothetical protein since it does not satisfy "on many different RefSeq genomes from the same, or different, species" this criterion? Should it not have been a XP* accession instead?

ADD REPLY
1
Entering edit mode

much of this information is undocumented

Yes, indeed, the documentation is not always very easy to discover. As I had mentioned in my response, the entire list of Entrez Link descriptions can be found here (the bit.ly link for it to remember is http://bit.ly/elink-descriptions). As for the databases available for search using esearch, you can invoke einfo -dbs to list the databases. Additionally, you can use einfo -db ipg (or any other legitimate value for db) to see information about the fields that can be used in the search query, link names and their descriptions, etc. Personally, I find the XML output of einfo -db ipg a bit cumbersome to 'read' but it came in handy more than few times in the past.

Should it not have been a XP* accession instead?

As described in this page and the other pages under the 'Related documentation', NCBI's prokaryote annotation pipeline (PGAP) now creates proteins with WP accessions for ALL proteins annotated on prokaryote assemblies. If there is another WP protein with the exact same sequence, then a new WP is not created and the old one is reused. Otherwise, a new WP is created. In this instance, only a single current RefSeq assembly encodes a protein matching that particular WP. The lack of information to assign an informative protein name does not factor into WP assignment.

ADD REPLY
0
Entering edit mode

Thanks to for all the replies, these were very useful!

ADD REPLY
2
Entering edit mode
4.7 years ago
vkkodali_ncbi ★ 3.8k

start from a list of protein accessions (i.e. WP_xxx) and get a correspondent list of assembly accessions (GCF_xxx).

If the intent here is to obtain a list of all GCF accessions that a particular WP is annotated on, then the answer above is not correct. It is fetching the entire list of GCFs for an organism irrespective of whether the WP is annotated on those assemblies or not. A couple of strategies to get the data you are looking for:

1. Use the IPG report

The IPG report for a WP accession will have the GCF accession. For example, here is the IPG table for WP_000134546. This particular WP appears to be annotated only on one assembly. For an example that is annotated on a number of assemblies, take a look at WP_003547430.1. You can simply download the entire table in CSV format by using the 'Send To' menu at the top right corner and choose 'File' as an option.

To do the same from the command line using Entrez Direct, you can repurpose the following code:

## returns tab-delimited data 
efetch -db ipg -id 'WP_000134546.1' -format ipg
Id        Source  Nucleotide Accession  Start  Stop   Strand  Protein         Protein Name          Organism                                    Strain  Assembly
33217001  RefSeq  NZ_AHLT01000014.1     13020  13304  -       WP_000134546.1  hypothetical protein  Staphylococcus aureus subsp. aureus IS-122  IS-122  GCF_000247415.1

The IPG output is available as XML too; you run the same command with -format ipg -mode xml. IPG report will give you a row for every single protein accession out there, whether there is a WP annotated there or not. So it can be quite long for some proteins. But it’s easy enough to parse for just the WPs.

2. Use elink from protein to nuccore to assembly

Here, you just hop through a couple of elink steps to go from NCBI Protein to NCBI Nucleotide to NCBI Assembly. This will return information about the assemblies but miss out on the rich information in the IPG report such as the nucleotide accession, range, strand, etc.

esearch -db protein -query 'WP_000134546' \
  | elink -target nuccore -name protein_nuccore_wp \
  | elink -db nuccore -target assembly -name nuccore_assembly \
  | esummary \
  | xtract -pattern DocumentSummary -element AssemblyAccession 
GCF_000247415.1

You can find more information about the elink descriptions here. NCBI has the following information on how to find related data starting with WPs.

ADD COMMENT

Login before adding your answer.

Traffic: 1604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6