I am trying to make a metagenome analysis for plant species. Since qiime2 uses Silva database and that specific database is commonly used for bacteria I customized all of my codes. Rn I have app. 11k row taxon ids that I get from NCBI database , but Im a having trouble doing a taxonomy match with those taxonomy ids. I need to match the taxonomy and filter the plant species and plot a pie chart for those plant species. I am told that NCBI does not have an API to use it to get the taxonomy names.
How can I solve my problem? Also, my code can be found below:
import pandas as pd import bs4
from Bio import Entrez
Initialize the NCBI email account
Entrez.email = "email_address"
def get_taxonomic_info(accession_number): """ Queries the NCBI database for taxonomic information of a given accession number.
Parameters:
- accession_number (str): The NCBI accession number.
Returns:
- str: The taxonomic information as a string.
"""
handle = Entrez.efetch(db="nuccore", id=accession_number, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# Extracting the taxonomic information
taxonomic_lineage = ""
for feature in record.features:
if feature.type == "source":
taxonomic_lineage = feature.qualifiers["db_xref"][0].split(":")[1]
break
return taxonomic_lineage
def main():
# Load the Excel file
df = pd.read_excel(r"file_path")
# Extract the accession numbers
accession_numbers = df.iloc[:, 1].tolist() # Assuming the accession numbers are in the second column
# Prepare the output file
with open(r"output_path", "w") as outfile:
for accession_number in accession_numbers:
taxonomic_info = get_taxonomic_info(accession_number)
outfile.write(f"{accession_number}\t{taxonomic_info}\n")
if __name__ == "__main__": main()