Question

From scientific name to taxonomy information entrez

0

Entering edit mode

2.7 years ago

LaFra ▴ 10

Hi all,

I have a txt file with a list of scientific names of plants and I would like to obtain a final file with taxonomy information. For example, if one of my organism is Acalypha hispida, I would like to obtain this output:

Order: Malpighiales; Family: Euphorbiaceae; Genus: Acalypha; Species: A. hispida

I have tried several codes and I know how to do it for just one organism, but I don't know how to do it like in a loop from a txt file.

One of these is:

while read line; do
esearch -db protein -query "$line[orgn]"| elink -target taxonomy |efetch -format xml >> prova.xml |xtract -element Lineage
done < org.txt

But I get an error...

Any ideas?

Thanks,

Entrez scientific taxonomy name • 1.9k views

ADD COMMENT • link updated 2.6 years ago by shenwei356 8.7k • written 2.7 years ago by LaFra ▴ 10

0

Entering edit mode

efetch -format xml >> prova.xml |xtract -element Lineage

you cannot redirect to a file and pipe into a command at the same time

ADD REPLY • link 2.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks a lot! So then if org.txt is my input file, the final code should be:

while read line; do efetch -format xml >> prova.xml |xtract -element Lineage done < org.txt

Is it correct?

ADD REPLY • link 2.7 years ago by LaFra ▴ 10

0

Entering edit mode

You can either use a for loop like this example (Retrieving gene ID using transcripts ID from Entrez database using CLI or Batch Entrez ) or you need to use < /dev/null for each esearch query in your while loop as shown here: NCBI E-eutilitis not working properly inside a while loop

ADD REPLY • link 2.7 years ago by GenoMax 147k

score 1 · Answer 1 · 2022-03-09

1

Entering edit mode

2.7 years ago

shenwei356 8.7k

You can also use taxonkit name2taxid and reformat, easy and fast.

$ head names.txt
Acalypha hispida 
Akkermansia muciniphila

$ taxonkit name2taxid names.txt \
    | taxonkit reformat -I 2 -f 'Kingdom: {k}; Phylum: {p}; Class: {c}; Order: {o}; Family: {f}; Genus: {g}; Species: {s}'
Acalypha hispida        197604  Kingdom: Eukaryota; Phylum: Streptophyta; Class: Magnoliopsida; Order: Malpighiales; Family: Euphorbiaceae; Genus: Acalypha; Species: Acalypha hispida
Akkermansia muciniphila 239935  Kingdom: Bacteria; Phylum: Verrucomicrobia; Class: Verrucomicrobiae; Order: Verrucomicrobiales; Family: Akkermansiaceae; Genus: Akkermansia; Species: Akkermansia muciniphila

ADD COMMENT • link 2.7 years ago by shenwei356 8.7k

0

Entering edit mode

This does not work, I got this error:

[ERRO] invalid TaxId: alba

My names.txt has the same format as yours..

ADD REPLY • link 2.7 years ago by LaFra ▴ 10

0

Entering edit mode

I guess your names.txt contains tab , you may need

sed -E 's/\s+/ /' names.txt > names2.txt

ADD REPLY • link 2.7 years ago by shenwei356 8.7k

score 0 · Answer 2 · 2022-03-09

Here is a script that can do what you want and more and demonstrates how to process linage information using Bio::Perl. You need to install Perl and Bio::Perl. You can download the NCBI taxonomy dump files for speeding it up. When you give it more than one taxon on the command line, it also computes the Last Common Ancestor of all.

Usage:

 ./getLCA.pl [-d directory] [-f taxon-file] [-g gi-file] [-tGR] <taxonlist>
Examples:
./getLCA.pl Salmo_salar Lepeophtheirus_salmonis Homo
./getLCA.pl -f taxonlist.txt

Output:

nodesfile or namesfile not found, using entrez online data

######################### Salmo_salar ##############################
ID: 8030 Sci.name: Salmo salar
 [Atlantic salmon authoritySalmo salar Linnaeus, 1758 ]
 Phylum: Chordata
 cellular organisms:Eukaryota:Opisthokonta:Metazoa:Eumetazoa:Bilateria:Deuterostomia:Chordata:Craniata:Vertebrata:Gnathostomata:Teleostomi:Euteleostomi:Actinopterygii:Actinopteri:Neopterygii:Teleostei:Osteoglossocephalai:Clupeocephala:Euteleosteomorpha:Protacanthopterygii:Salmoniformes:Salmonidae:Salmoninae:Salmo

######################### Lepeophtheirus_salmonis ##############################
ID: 72036 Sci.name: Lepeophtheirus salmonis
 [salmon louse authorityLepeophtheirus salmonis (Kroyer, 1837) misspellingLepeoptheirus salmonis ]
 Phylum: Arthropoda
 cellular organisms:Eukaryota:Opisthokonta:Metazoa:Eumetazoa:Bilateria:Protostomia:Ecdysozoa:Panarthropoda:Arthropoda:Mandibulata:Pancrustacea:Crustacea:Multicrustacea:Hexanauplia:Copepoda:Neocopepoda:Podoplea:Siphonostomatoida:Caligidae:Lepeophtheirus

######################### Homo ##############################
ID: 9605 Sci.name: Homo
 [humans authorityHomo Linnaeus, 1758 ]
 Phylum: Chordata
 cellular organisms:Eukaryota:Opisthokonta:Metazoa:Eumetazoa:Bilateria:Deuterostomia:Chordata:Craniata:Vertebrata:Gnathostomata:Teleostomi:Euteleostomi:Sarcopterygii:Dipnotetrapodomorpha:Tetrapoda:Amniota:Mammalia:Theria:Eutheria:Boreoeutheria:Euarchontoglires:Primates:Haplorrhini:Simiiformes:Catarrhini:Hominoidea:Hominidae:Homininae

############################ LCA #################################
LCA: 33213 Bilateria
Lineage of LCA: cellular organisms:Eukaryota:Opisthokonta:Metazoa:Eumetazoa:Bilateria

Options:

-d: directory containing nodes.dmp and names.dmp files from the NCBI taxonomy, otherwise current directory
-t: convert taxon names to numeric taxids, print one per line
-f: [file] path to text file containing taxa, one taxon per line, scientif name or numeric tax-id
-g: [file] path to gi taxid mapping file for blast
-G: generate a gi list of all gis provided in -g matching taxonlist
-R: requires -G, generate a gi list including all subtaxa, too