Question

Help with e-utils: Need one assembly accession from one protein accession

0

Entering edit mode

2.9 years ago

Morgan S. ▴ 90

Hi,

I was given a list of protein accessions and the associated taxa, but I need the assembly accession to match the protein and taxonomy. Each protein is for different taxa in my case. From this post p/429609, I gather getting this information is difficult because,

WP* records represents a single, non-redundant, protein sequence which may be annotated on many different RefSeq genomes from the same, or different, species.

I found this to be the case when using e-utils as below:

myinputarg=$(cat protein_accessions.txt| tr "\n" ","); elink -id $myinputarg -db protein -target nuccore | efetch -format acc > assemblyAccessions.txt

One solution based on the above post is that I could use the -name option, but how would this work for multiple different taxa? vkkodali_ncbi do you kindly have any advice for me?

Thanks in advance! Morgan

ncbi protein accession genome database • 1.3k views

ADD COMMENT • link updated 2.9 years ago by vkkodali_ncbi ★ 3.8k • written 2.9 years ago by Morgan S. ▴ 90

score 2 · Answer 1 · 2022-08-02

Instead of nuccore you should target ipg (identical protein groups database)

Here is an example with efetch

$ head id.txt
WP_133179913
WP_201696567

$ cat id.txt | while read p; do echo $p; efetch -db ipg -id $p -format ipg > out_efetch/$p.tab; done;

For each accession in id.txt you will get a tab file with the following information:

Id      Source  Nucleotide Accession    Start   Stop    Strand  Protein Protein Name    Organism        Strain  Assembly
375761440       RefSeq  NZ_CAJHCQ010000006.1    249364  251391  +       WP_201696567.1  AraC family transcriptional regulator N-terminal domain-containing protein      Paraburkholderia hiiakae   LMG 27952       GCF_904848665.1
375761440       INSDC   CAJHCQ010000006.1       249364  251391  +       CAD6533940.1    HTH-type transcriptional activator RhaS Paraburkholderia hiiakae        LMG 27952       GCA_904848665.1

edit: sorry, I missed the part where you were interested also in the taxonomy. In this case I would suggest to download the latest GTDB metadata files and use the RefSeq Assembly accession (GCF_***) to add in your dataframe the GTDB taxonomic lineage.

score 0 · Answer 2 · 2022-08-02

You can use NCBI Datasets for this. Specifically, you can use the command-line tool as follows:

$ datasets download gene accession WP_003547430.1 --exclude-gene --exclude-protein --exclude-rna --taxon-filter 1703964
Downloading: ncbi_dataset.zip    2.23kB done
$ unzip -v ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
    1604  Defl:N      769  52% 2022-08-02 17:21 3de26d82  README.md
     307  Defl:N      230  25% 2022-08-02 17:21 e4222cba  ncbi_dataset/data/data_report.jsonl
     525  Defl:N      254  52% 2022-08-02 17:21 ac47b899  ncbi_dataset/data/annotation_report.jsonl
     275  Defl:N      153  44% 2022-08-02 17:21 524b1567  ncbi_dataset/data/dataset_catalog.json
--------          -------  ---                            -------
    2711             1406  48%                            4 files
$ unzip ncbi_dataset.zip ncbi_dataset/data/annotation_report.jsonl
Archive:  ncbi_dataset.zip
  inflating: ncbi_dataset/data/annotation_report.jsonl

The JSONL file can then be parsed using the Datasets tool dataformat to generate a table as follows:

$ dataformat tsv prok-gene-location \
  --fields protein-accession,organism-organism-name,organism-tax-id,refseq-genomic-location-assembly-accession \
  --inputfile ncbi_dataset/data/annotation_report.jsonl
Protein Accession  Organism Organism Name  Organism Taxonomic ID  RefSeq Genomic Location Assembly Accession
WP_003547430.1     Rhizobium sp. N621      1703964                GCF_001664325.1