Hello guys,
I have a list of 10 000 accessions id from blast in a txt file (XP_002184977.1 GBG35237.1) and I would like to have the taxonomy associated (in a tab file that correpond to my accessions line per line). So I'm using Entrez direct like this :
esearch -db protein -query "XP_002184977.1" \
| elink -target taxonomy \
| efetch -format native -mode xml \
| xtract -pattern Taxon -block "*/Taxon" -unless Rank -equals "no rank" -tab "\n" -element Rank,TaxId,ScientificName
As a result I have different lines for each taxonomy level
superkingdom 2759 Eukaryota
clade 2698737 Sar
clade 33634 Stramenopiles
clade 2696291 Ochrophyta
phylum 2836 Bacillariophyta
class 33849 Bacillariophyceae
clade 33850 Bacillariophycidae
order 38748 Naviculales
family 38749 Phaeodactylaceae
genus 2849 Phaeodactylum
species 2850 Phaeodactylum tricornutum
However the output is not always consistant according to the hit accession (sometimes there is no line phylum which is the line that interest me). So if I had a "grep phylum" I can't concatenaate my hit accession with the results...
Any ideas on how to deal with that? It would be great if i could have the whole taxonomy with "N.A" in the column phylum if the information is not present in the database. I have tried some other tools like taxonomisr, unsuccessfully ...
There is going to be no good solution for Entrezdirect. If that information is not there then it is not going to show up. I would have said get the tax dump file from NCBI but since EntrezDirect is looking at that same database the result will not be different.
Thanks for your advice, you are right. Then the best for me would be to have the full information like for each lvl, the result (for instance "human" fo species or "Non assigned" if not found) but I don't know how to do that.
Or a less ambitious way would be to get the phylum with "grep phylum", and if it does not find anything, add a "0" for instance so it stills add a line in my file (how to do that though..), so I can match my results with my accession file.
By the way is there a way to do this search with dowloading the database to gain some time ? Because 10000 accessions seem too much for edirect and I have to split my files.
The fastest way would be to download the base files from NCBI and parse the information you need from them. This is the approach taken by several tools that need to gather taxonomic information from NCBI accessions, such as BlobTools or KronaTools.
Ok thanks ! I'm open to new tools too. i have checked Blobtools doc (https://blobtools.readme.io/docs/taxify). But I haven't use it before and I don't really get exactly how to use my data with it
You can find
accession2taxid
files in this folder at NCBI's FTP site. You could get that file and then parse locally. I don't recall if that file will give you all the intermediate levels.