Trying to match Accession numbers to Taxonomy IDs
3
0
Entering edit mode
7.1 years ago

Currently writing a sequence alignment programme in javascript and i want to be able to match the genbank accession numbers to species names and for a list of 500+ accessions. Is there any easy way to do this rather than changing over to python?

alignment gene genbank conversion taxonomy • 5.4k views
ADD COMMENT
0
Entering edit mode

in javascript

context needed: is it a cmd-line program or is it a web-page ?

ADD REPLY
0
Entering edit mode

its on a webpage so i'm work in java then convert to html format

ADD REPLY
0
Entering edit mode

in java ????? I hope you mean javascript ?!

ADD REPLY
0
Entering edit mode

Java is to javascript as ham is to hamster, right? :-p

ADD REPLY
0
Entering edit mode

yup javascript! as the name suggests this is all new territory for me!

ADD REPLY
1
Entering edit mode
7.1 years ago
Sej Modha 5.3k

A simplified version of the above example with the edirect unix eutils would be:

esearch -db nuccore -query FJ070571|elink -target taxonomy |efetch -format uid
ADD COMMENT
0
Entering edit mode

Dear community,

I don't think the following question deserve a new post, given that it's largely inspired from Sej Modha answer, thus I am commenting on his answer :

Question title : "Retrieving blast staxids from saccver with ncbi E-utilities"

I am parsing blast+ output (custom outfmt 6 with saccver column).
I am interested in retrieving the staxids info for each hit.
The blast was done against a custom database, thus I cannot request the staxids column directly in the output.
But I can map saccver with staxids.

My issue is that I did not manage to do it efficiently yet, let me explain :

The output contains a large number of hits, representing many different saccver.
Ideally, I need to be able to query thousands of saccver at once with efetch / esearch and get a result where each saccver is associated with the corresponding staxids (a Python dictionary for example).

If I loop on my list of saccver, the result is fine but slow : Example for a single query :

time esearch -db protein \
             -query XP_001778611.1 \
             | elink -target taxonomy \
             | efetch -format uid;

3218
real 0m2.332s

Trying the same as above with 5 queries, hoping that the results are in the same order as the input :
Truth :
saccver, staxids
XP_001778611.1, 3218
XP_006409976.1, 72664
NP_001149654.2, 4577
XP_014499952.1, 3916
XP_016732373.1, 3635

time esearch -db protein \
             -query XP_001778611.1,NP_001149654.2,XP_014499952.1,XP_006409976.1,XP_016732373.1 \
             | elink -target taxonomy \
             | efetch -format uid;

72664
4577
3916
3635
3218
real 0m2.290s

Sadly, staxids results are sorted decreasingly, thus I don't see a way to map them to input saccver list here.

Last thing I tried was querying in the xml protein_db output, but I still fail to map saccver to staxids here :

time efetch -db protein \
            -id XP_001778611.1,XP_006409976.1,NP_001149654.2 \
            -format xml \
            | xtract -pattern Bioseq-set_seq-set \
                     -group Seqdesc_source \
                     -sep "@" \
                     -element Object-id_id \
                     -group Textseq-id \
                     -sep "@" \
                     -element Textseq-id_accession;

3218 72664 4577 XM_001778559 XP_001778611 XM_006409913 XP_006409976 NM_001156182 NP_001149654
real 0m1.204s

Any suggestions ?
Best regards

ADD REPLY
1
Entering edit mode

If you have a very large set of IDs, maybe you could run a local search against the accession to taxonomy data using the files available on ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/

ADD REPLY
0
Entering edit mode

That seems clever, I'll try it this afternoon

ADD REPLY
1
Entering edit mode

Ok, I should be able to handle it from there, thanks for the tips :

time zgrep -f list_5.txt \
           prot.accession2taxid.gz \
           | awk '{print $2"\t"$3}' $_

XP_001778611.1 3218
NP_001149654.2 4577
XP_006409976.1 72664
XP_014499952.1 3916
XP_016732373.1 3635

real 1m33.959s

ADD REPLY
0
Entering edit mode

Glad to know! An independent awk matching should be quicker than grep.

ADD REPLY
0
Entering edit mode

The last hurdle :

My protein database was downloaded from ncbi ftp :

$ wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/*.protein.faa.gz &>> wget.verbose;
$ cat *faa.gz > plant_refseq.faa.gz;

Then, I launch a blastx against this plant_refseq database.

When i query all saccver in my blastx output vs prot.accession2taxid.gz, some are not found.
Here are some examples :
XP_004986996.1
XP_012703670.2
XP_013748334.1

The first saccver problem (XP_004986996.1) is easy to solve :
In my blast, the subject was "XP_004986996.1".
But in the file prot.accession2taxid.gz, it was updated to the latest version = XP_004986996.2.

 $ zgrep 'XP_004986996' prot.accession2taxid.gz;
 accession  accession.version   taxid   gi
 XP_004986996   XP_004986996.2  4555    1256755570

Thus my search should be done on accession (XP_004986996.) and not accession+version (XP_004986996.1).

What's worrying is the last 2 saccver examples :
They cannot be found with the above correction.

I can query them in my initial file :

$ zgrep 'XP_013748334.1' ~/Desktop/RTDG/BISCEm/Data/plant_refseq.faa.gz;
>XP_013748334.1 PREDICTED: uncharacterized protein LOC106451027 isoform X1 [Brassica napus]

If you run an ncbi search with the term "XP_013748334", you find nothing.
If you try with "LOC106451027", you get : https://www.ncbi.nlm.nih.gov/protein/92362375.
On this page we find :
Accession : XP_013748335
REFSEQ: accession XM_013892881.2

Let's try to query with this newly found accession :

$ zgrep 'XP_013748335' prot.accession2taxid.gz 
XP_013748335    XP_013748335.1  3708    923623754

$ esearch -db protein \
          -query XP_013748335 \
          | elink -target taxonomy \
          | efetch -format uid;
3708

But it seems pretty hard to traceback / implement from my initial blastx result.

I did read ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/README, but I still do not figure how prot.accession2taxid was constructed : Is it logical that protein with "uncharacterized", "probable", etc... status are not included (which is the case in my example) ?

To rephrase, it seems that entries in prot.accession2taxid and plant_refseq.faa.gz have not been updated at the same time (things in plant_refseq.faa.gz being older it seems). Is there a way to be on the same page for both?

ADD REPLY
1
Entering edit mode

It looks like that record has been removed from https://www.ncbi.nlm.nih.gov/protein/XP_013748334.1?report=genpept. I would download dead_prot.accession2taxid files and combine it with the ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz to ensure that you capture all accession numbers.

ADD REPLY
0
Entering edit mode

Ok, it seems to be a clever suggestion once again. Thanks ! (I lack a bit of knowledge on how ncbi data are organized / how to query them efficiently)

ADD REPLY
0
Entering edit mode

and if you think you have pdb accession numbers in the customised dataset then download the ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz and combine them with the other files.

ADD REPLY
1
Entering edit mode
7.1 years ago

Here is a javascript code using CORS / XML / XPATH / NCBI-ESummary . Tested with firefox.

ADD COMMENT
0
Entering edit mode
7.1 years ago
erwan.scaon ▴ 950

This may be a starting point for you : Using efetch + xtract (I guess you can call it inside your javascript code ?)

An example with this Pinus taeda nucleotide sequence : https://www.ncbi.nlm.nih.gov/nuccore/FJ070571

an='FJ070571';
efetch -db nuccore \
       -id $an \
       -format xml \
       | xtract -pattern Bioseq-set_seq-set \
                -division Seq-entry \
                -group Bioseq \
                -if Textseq-id_accession -equals $an \
                -element Textseq-id_accession \
                -branch Bioseq_descr \
                -block Org-ref_db \
                -element Object-id_id;

Output = 3352, which is the taxonomy ID for Pinus taeda (https://www.ncbi.nlm.nih.gov/taxonomy/?term=3352)

Ps : This command was not thoroughly tested, thus the efetch parsing might need some polishing to be successful for all possibles cases.

ADD COMMENT

Login before adding your answer.

Traffic: 1503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6