Help with e-utils: Need one assembly accession from one protein accession
2
0
Entering edit mode
2.3 years ago
Morgan S. ▴ 90

Hi,

I was given a list of protein accessions and the associated taxa, but I need the assembly accession to match the protein and taxonomy. Each protein is for different taxa in my case. From this post p/429609, I gather getting this information is difficult because,

WP* records represents a single, non-redundant, protein sequence which may be annotated on many different RefSeq genomes from the same, or different, species.

I found this to be the case when using e-utils as below:

myinputarg=$(cat protein_accessions.txt| tr "\n" ","); elink -id $myinputarg -db protein -target nuccore | efetch -format acc > assemblyAccessions.txt

One solution based on the above post is that I could use the -name option, but how would this work for multiple different taxa? vkkodali_ncbi do you kindly have any advice for me?

Thanks in advance! Morgan

ncbi protein accession genome database • 937 views
ADD COMMENT
2
Entering edit mode
2.3 years ago

Instead of nuccore you should target ipg (identical protein groups database)

Here is an example with efetch

$ head id.txt
WP_133179913
WP_201696567

$ cat id.txt | while read p; do echo $p; efetch -db ipg -id $p -format ipg > out_efetch/$p.tab; done;

For each accession in id.txt you will get a tab file with the following information:

Id      Source  Nucleotide Accession    Start   Stop    Strand  Protein Protein Name    Organism        Strain  Assembly
375761440       RefSeq  NZ_CAJHCQ010000006.1    249364  251391  +       WP_201696567.1  AraC family transcriptional regulator N-terminal domain-containing protein      Paraburkholderia hiiakae   LMG 27952       GCF_904848665.1
375761440       INSDC   CAJHCQ010000006.1       249364  251391  +       CAD6533940.1    HTH-type transcriptional activator RhaS Paraburkholderia hiiakae        LMG 27952       GCA_904848665.1

edit: sorry, I missed the part where you were interested also in the taxonomy. In this case I would suggest to download the latest GTDB metadata files and use the RefSeq Assembly accession (GCF_***) to add in your dataframe the GTDB taxonomic lineage.

ADD COMMENT
0
Entering edit mode
2.3 years ago
vkkodali_ncbi ★ 3.8k

You can use NCBI Datasets for this. Specifically, you can use the command-line tool as follows:

$ datasets download gene accession WP_003547430.1 --exclude-gene --exclude-protein --exclude-rna --taxon-filter 1703964
Downloading: ncbi_dataset.zip    2.23kB done
$ unzip -v ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
    1604  Defl:N      769  52% 2022-08-02 17:21 3de26d82  README.md
     307  Defl:N      230  25% 2022-08-02 17:21 e4222cba  ncbi_dataset/data/data_report.jsonl
     525  Defl:N      254  52% 2022-08-02 17:21 ac47b899  ncbi_dataset/data/annotation_report.jsonl
     275  Defl:N      153  44% 2022-08-02 17:21 524b1567  ncbi_dataset/data/dataset_catalog.json
--------          -------  ---                            -------
    2711             1406  48%                            4 files
$ unzip ncbi_dataset.zip ncbi_dataset/data/annotation_report.jsonl
Archive:  ncbi_dataset.zip
  inflating: ncbi_dataset/data/annotation_report.jsonl

The JSONL file can then be parsed using the Datasets tool dataformat to generate a table as follows:

$ dataformat tsv prok-gene-location \
  --fields protein-accession,organism-organism-name,organism-tax-id,refseq-genomic-location-assembly-accession \
  --inputfile ncbi_dataset/data/annotation_report.jsonl
Protein Accession  Organism Organism Name  Organism Taxonomic ID  RefSeq Genomic Location Assembly Accession
WP_003547430.1     Rhizobium sp. N621      1703964                GCF_001664325.1
ADD COMMENT

Login before adding your answer.

Traffic: 2548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6