Question

From accession ID to phylum taxid

0

Entering edit mode

4.1 years ago

doinelpierrot ▴ 50

Hello guys,

I have a list of 10 000 accessions id from blast in a txt file (XP_002184977.1 GBG35237.1) and I would like to have the taxonomy associated (in a tab file that correpond to my accessions line per line). So I'm using Entrez direct like this :

esearch -db protein -query "XP_002184977.1" \
  | elink -target taxonomy \
  | efetch -format native -mode xml \
  | xtract -pattern Taxon -block "*/Taxon" -unless Rank -equals "no rank" -tab "\n" -element Rank,TaxId,ScientificName

As a result I have different lines for each taxonomy level

superkingdom    2759    Eukaryota
clade   2698737 Sar
clade   33634   Stramenopiles
clade   2696291 Ochrophyta
phylum  2836    Bacillariophyta
class   33849   Bacillariophyceae
clade   33850   Bacillariophycidae
order   38748   Naviculales
family  38749   Phaeodactylaceae
genus   2849    Phaeodactylum
species 2850    Phaeodactylum tricornutum

However the output is not always consistant according to the hit accession (sometimes there is no line phylum which is the line that interest me). So if I had a "grep phylum" I can't concatenaate my hit accession with the results...

Any ideas on how to deal with that? It would be great if i could have the whole taxonomy with "N.A" in the column phylum if the information is not present in the database. I have tried some other tools like taxonomisr, unsuccessfully ...

RNA-Seq • 2.2k views

ADD COMMENT • link updated 4.1 years ago by shenwei356 8.7k • written 4.1 years ago by doinelpierrot ▴ 50

1

Entering edit mode

There is going to be no good solution for Entrezdirect. If that information is not there then it is not going to show up. I would have said get the tax dump file from NCBI but since EntrezDirect is looking at that same database the result will not be different.

ADD REPLY • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

Thanks for your advice, you are right. Then the best for me would be to have the full information like for each lvl, the result (for instance "human" fo species or "Non assigned" if not found) but I don't know how to do that.

Or a less ambitious way would be to get the phylum with "grep phylum", and if it does not find anything, add a "0" for instance so it stills add a line in my file (how to do that though..), so I can match my results with my accession file.

By the way is there a way to do this search with dowloading the database to gain some time ? Because 10000 accessions seem too much for edirect and I have to split my files.

ADD REPLY • link 4.1 years ago by doinelpierrot ▴ 50

1

Entering edit mode

The fastest way would be to download the base files from NCBI and parse the information you need from them. This is the approach taken by several tools that need to gather taxonomic information from NCBI accessions, such as BlobTools or KronaTools.

ADD REPLY • link 4.1 years ago by h.mon 35k

0

Entering edit mode

Ok thanks ! I'm open to new tools too. i have checked Blobtools doc (https://blobtools.readme.io/docs/taxify). But I haven't use it before and I don't really get exactly how to use my data with it

ADD REPLY • link 4.1 years ago by doinelpierrot ▴ 50

0

Entering edit mode

You can find accession2taxid files in this folder at NCBI's FTP site. You could get that file and then parse locally. I don't recall if that file will give you all the intermediate levels.

ADD REPLY • link 4.1 years ago by GenoMax 147k

score 4 · Answer 1 · 2020-10-20

Some species indeed have no phylum taxon level, so you can not retrieve the phylum taxid. e.g.,

echo 314101 \
    | taxonkit lineage \
    | taxonkit reformat \
    | csvtk -H -t cut -f 1,3 \
    | csvtk -H -t sep -f 2 -s ';' -R \
    | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \
    | csvtk pretty -t 
taxid    kindom     phylum   class   order   family   genus   species
314101   Bacteria                                             uncultured murine large bowel bacterium BAC 54B

But you can assign it a pseudo taxon name in downstream analysis.

Here's a solution with my taxonkit (v0.6.1rc1 or later versions). prot.accession2taxid.gz is needed. csvtk and taxonkit are available on bioconda.

cat acc.txt
XP_002184977.1
AAX16399.1

# 1) get taxid for all accession
# 7min
time pigz -dc prot.accession2taxid.gz \
    | csvtk grep -t -f accession.version -P acc.txt \
    | csvtk cut -t -f accession.version,taxid \
    | sed 1d \
    > acc2taxid.txt

cat acc2taxid.txt
XP_002184977.1  556484
AAX16399.1  314101


# 2) get taxonomy
# 6s
time cat acc2taxid.txt \
    | taxonkit lineage -i 2 \
    | taxonkit reformat -f '{p}' -t --fill-miss-rank -i 3 \
    | csvtk cut -t -f 1,2,4,5 \
    | csvtk add-header -t -n accession,taxid,phylum,phylum-taxid \
    | csvtk pretty -t 
accession        taxid    phylum                         phylum-taxid
XP_002184977.1   556484   Bacillariophyta                2836
AAX16399.1       314101   unclassified Bacteria phylum

There's a bug here, it supposed to be unclassified Bacteria phylum (update: fixed, use v0.6.1 or later versions).

--fill-miss-rank in taxonkit reformat is for fillling missing rank with higher taxon.