I was given a list of protein accessions and the associated taxa, but I need the assembly accession to match the protein and taxonomy. Each protein is for different taxa in my case. From this post p/429609, I gather getting this information is difficult because,
WP* records represents a single, non-redundant, protein sequence which may be annotated on many different RefSeq genomes from the same, or different, species.
I found this to be the case when using e-utils as below:
One solution based on the above post is that I could use the -name option, but how would this work for multiple different taxa?
vkkodali_ncbi do you kindly have any advice for me?
Instead of nuccore you should target ipg (identical protein groups database)
Here is an example with efetch
$ head id.txt
WP_133179913
WP_201696567
$ cat id.txt | while read p; do echo $p; efetch -db ipg -id $p -format ipg > out_efetch/$p.tab; done;
For each accession in id.txt you will get a tab file with the following information:
Id Source Nucleotide Accession Start Stop Strand Protein Protein Name Organism Strain Assembly
375761440 RefSeq NZ_CAJHCQ010000006.1 249364 251391 + WP_201696567.1 AraC family transcriptional regulator N-terminal domain-containing protein Paraburkholderia hiiakae LMG 27952 GCF_904848665.1
375761440 INSDC CAJHCQ010000006.1 249364 251391 + CAD6533940.1 HTH-type transcriptional activator RhaS Paraburkholderia hiiakae LMG 27952 GCA_904848665.1
edit: sorry, I missed the part where you were interested also in the taxonomy. In this case I would suggest to download the latest GTDB metadata files and use the RefSeq Assembly accession (GCF_***) to add in your dataframe the GTDB taxonomic lineage.