From scientific name to taxonomy information entrez
2
0
Entering edit mode
2.7 years ago
LaFra ▴ 10

Hi all,

I have a txt file with a list of scientific names of plants and I would like to obtain a final file with taxonomy information. For example, if one of my organism is Acalypha hispida, I would like to obtain this output:

Order: Malpighiales; Family: Euphorbiaceae; Genus: Acalypha; Species: A. hispida

I have tried several codes and I know how to do it for just one organism, but I don't know how to do it like in a loop from a txt file.

One of these is:

while read line; do
esearch -db protein -query "$line[orgn]"| elink -target taxonomy |efetch -format xml >> prova.xml |xtract -element Lineage
done < org.txt

But I get an error...

Any ideas?

Thanks,

Entrez scientific taxonomy name • 1.9k views
ADD COMMENT
0
Entering edit mode
efetch -format xml >> prova.xml |xtract -element Lineage

you cannot redirect to a file and pipe into a command at the same time

ADD REPLY
0
Entering edit mode

Thanks a lot! So then if org.txt is my input file, the final code should be:

while read line; do efetch -format xml >> prova.xml |xtract -element Lineage done < org.txt

Is it correct?

ADD REPLY
0
Entering edit mode

You can either use a for loop like this example (Retrieving gene ID using transcripts ID from Entrez database using CLI or Batch Entrez ) or you need to use < /dev/null for each esearch query in your while loop as shown here: NCBI E-eutilitis not working properly inside a while loop

ADD REPLY
1
Entering edit mode
2.7 years ago

You can also use taxonkit name2taxid and reformat, easy and fast.

$ head names.txt
Acalypha hispida 
Akkermansia muciniphila

$ taxonkit name2taxid names.txt \
    | taxonkit reformat -I 2 -f 'Kingdom: {k}; Phylum: {p}; Class: {c}; Order: {o}; Family: {f}; Genus: {g}; Species: {s}'
Acalypha hispida        197604  Kingdom: Eukaryota; Phylum: Streptophyta; Class: Magnoliopsida; Order: Malpighiales; Family: Euphorbiaceae; Genus: Acalypha; Species: Acalypha hispida
Akkermansia muciniphila 239935  Kingdom: Bacteria; Phylum: Verrucomicrobia; Class: Verrucomicrobiae; Order: Verrucomicrobiales; Family: Akkermansiaceae; Genus: Akkermansia; Species: Akkermansia muciniphila
ADD COMMENT
0
Entering edit mode

This does not work, I got this error:

[ERRO] invalid TaxId: alba

My names.txt has the same format as yours..

ADD REPLY
0
Entering edit mode

I guess your names.txt contains tab , you may need

sed -E 's/\s+/ /' names.txt > names2.txt
ADD REPLY
0
Entering edit mode
2.7 years ago
Michael 55k

Here is a script that can do what you want and more and demonstrates how to process linage information using Bio::Perl. You need to install Perl and Bio::Perl. You can download the NCBI taxonomy dump files for speeding it up. When you give it more than one taxon on the command line, it also computes the Last Common Ancestor of all.

Usage:

 ./getLCA.pl [-d directory] [-f taxon-file] [-g gi-file] [-tGR] <taxonlist>
Examples:
./getLCA.pl Salmo_salar Lepeophtheirus_salmonis Homo
./getLCA.pl -f taxonlist.txt

Output:

nodesfile or namesfile not found, using entrez online data

######################### Salmo_salar ##############################
ID: 8030 Sci.name: Salmo salar
 [Atlantic salmon authoritySalmo salar Linnaeus, 1758 ]
 Phylum: Chordata
 cellular organisms:Eukaryota:Opisthokonta:Metazoa:Eumetazoa:Bilateria:Deuterostomia:Chordata:Craniata:Vertebrata:Gnathostomata:Teleostomi:Euteleostomi:Actinopterygii:Actinopteri:Neopterygii:Teleostei:Osteoglossocephalai:Clupeocephala:Euteleosteomorpha:Protacanthopterygii:Salmoniformes:Salmonidae:Salmoninae:Salmo

######################### Lepeophtheirus_salmonis ##############################
ID: 72036 Sci.name: Lepeophtheirus salmonis
 [salmon louse authorityLepeophtheirus salmonis (Kroyer, 1837) misspellingLepeoptheirus salmonis ]
 Phylum: Arthropoda
 cellular organisms:Eukaryota:Opisthokonta:Metazoa:Eumetazoa:Bilateria:Protostomia:Ecdysozoa:Panarthropoda:Arthropoda:Mandibulata:Pancrustacea:Crustacea:Multicrustacea:Hexanauplia:Copepoda:Neocopepoda:Podoplea:Siphonostomatoida:Caligidae:Lepeophtheirus

######################### Homo ##############################
ID: 9605 Sci.name: Homo
 [humans authorityHomo Linnaeus, 1758 ]
 Phylum: Chordata
 cellular organisms:Eukaryota:Opisthokonta:Metazoa:Eumetazoa:Bilateria:Deuterostomia:Chordata:Craniata:Vertebrata:Gnathostomata:Teleostomi:Euteleostomi:Sarcopterygii:Dipnotetrapodomorpha:Tetrapoda:Amniota:Mammalia:Theria:Eutheria:Boreoeutheria:Euarchontoglires:Primates:Haplorrhini:Simiiformes:Catarrhini:Hominoidea:Hominidae:Homininae

############################ LCA #################################
LCA: 33213 Bilateria
Lineage of LCA: cellular organisms:Eukaryota:Opisthokonta:Metazoa:Eumetazoa:Bilateria

Options:

-d: directory containing nodes.dmp and names.dmp files from the NCBI taxonomy, otherwise current directory
-t: convert taxon names to numeric taxids, print one per line
-f: [file] path to text file containing taxa, one taxon per line, scientif name or numeric tax-id
-g: [file] path to gi taxid mapping file for blast
-G: generate a gi list of all gis provided in -g matching taxonlist
-R: requires -G, generate a gi list including all subtaxa, too

ADD COMMENT
0
Entering edit mode

This seems a bit complicated, but I still don't understand how to put as input a list of organism names, like a txt file.

Thank you!

ADD REPLY
0
Entering edit mode

This is just meant as a starting point. You can modify the code to read taxa from a file.

ADD REPLY
0
Entering edit mode

Check out the script. I have added an option -f file to read taxa from a text file:


taxlist.txt:

Lepeophtheirus salmonis
Salmo salar
Homo
ADD REPLY

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6