I have a txt file with a list of scientific names of plants and I would like to obtain a final file with taxonomy information. For example, if one of my organism is Acalypha hispida, I would like to obtain this output:
Order: Malpighiales; Family: Euphorbiaceae; Genus: Acalypha; Species: A. hispida
I have tried several codes and I know how to do it for just one organism, but I don't know how to do it like in a loop from a txt file.
One of these is:
while read line; do
esearch -db protein -query "$line[orgn]"| elink -target taxonomy |efetch -format xml >> prova.xml |xtract -element Lineage
done < org.txt
Here is a script that can do what you want and more and demonstrates how to process linage information using Bio::Perl.
You need to install Perl and Bio::Perl. You can download the NCBI taxonomy dump files for speeding it up. When you give it more than one taxon on the command line, it also computes the Last Common Ancestor of all.
-d: directory containing nodes.dmp and names.dmp files from the NCBI taxonomy, otherwise current directory
-t: convert taxon names to numeric taxids, print one per line
-f: [file] path to text file containing taxa, one taxon per line, scientif name or numeric tax-id
-g: [file] path to gi taxid mapping file for blast
-G: generate a gi list of all gis provided in -g matching taxonlist
-R: requires -G, generate a gi list including all subtaxa, too
you cannot redirect to a file and pipe into a command at the same time
Thanks a lot! So then if org.txt is my input file, the final code should be:
while read line; do efetch -format xml >> prova.xml |xtract -element Lineage done < org.txt
Is it correct?
You can either use a
for
loop like this example (Retrieving gene ID using transcripts ID from Entrez database using CLI or Batch Entrez ) or you need to use< /dev/null
for eachesearch
query in yourwhile
loop as shown here: NCBI E-eutilitis not working properly inside a while loop