Entering edit mode
11 months ago
Mani
•
0
Hi,
I have a file containing scientific names for a number of species (one per line) and want to extract major taxonomic information for these species ( class, order, family) using command line and write it in an output tab separated file. For example my input file is:
cat input.txt
Danio rerio
Carassius gibelio
Mugil cephalus
and I want the output to be :
cat output.txt
Danio rerio Actinopterygii Cypriniformes Cyprinidae
Carassius gibelio Actinopterygii Cypriniformes Cyprinidae
Mugil cephalus Actinopterygii Mugiliformes Mugilidae
I have a python script for this, but it does not extract the correct taxonomy for all queries. Here is my script:
cat myscript.py
import sys
from Bio import Entrez
def get_taxonomic_info(species_name):
try:
Entrez.email = "xxxx" # Add your email here
# Search for the species in the Taxonomy database
search_handle = Entrez.esearch(db="taxonomy", term=species_name, retmode="xml")
search_results = Entrez.read(search_handle)
# Check if any results are found
if search_results["Count"] == "0":
print(f"Error: {species_name} not found in the NCBI Taxonomy database")
return "N/A", "N/A", "N/A"
# Retrieve taxonomic information using the first result
tax_id = search_results["IdList"][0]
tax_handle = Entrez.efetch(db="taxonomy", id=tax_id, retmode="xml")
tax_record = Entrez.read(tax_handle)
# Extract class, order, and family from the taxonomic information
lineage = tax_record[0].get("LineageEx", [])
class_info = next((entry["ScientificName"] for entry in lineage if entry["Rank"] == "class"), "N/A")
order_info = next((entry["ScientificName"] for entry in lineage if entry["Rank"] == "order"), "N/A")
family_info = next((entry["ScientificName"] for entry in lineage if entry["Rank"] == "family"), "N/A")
return class_info, order_info, family_info
except Exception as e:
print(f"Error: An unexpected error occurred for {species_name}")
print(f"Details: {e}")
return "N/A", "N/A", "N/A"
def process_input(input_file, output_file):
with open(input_file, "r") as infile, open(output_file, "w") as outfile:
for line in infile:
species_name = line.strip()
class_info, order_info, family_info = get_taxonomic_info(species_name)
outfile.write(f"{species_name}\t{class_info}\t{order_info}\t{family_info}\n")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python3 add_taxonomic_info.py input.txt output.txt")
sys.exit(1)
input_file = sys.argv[1]
output_file = sys.argv[2]
process_input(input_file, output_file)
when I run it:
python3 input.txt output.txt
and here is the output:
cat output.txt
Danio rerio Actinopteri Cypriniformes Danionidae
Carassius gibelio Actinopteri Cypriniformes Cyprinidae
Mugil cephalus Actinopteri Mugiliformes Mugilidae
Any help would be appreciated!
Cheers,
Mani
Posting as an additional command line option. Using Entrezdirect one can get:
That may be par for course. Not every entrez query returns complete information from taxonomy database. You are going to get information where it exists in the database.
BTW the output you are showing above seems to match what you want.
As noted by GenoMax , the outcome depends on the available information in the database, so some queries may yield missing data. Having said this, I have a small piece of code in
R
that can work here.For your example :
Output :