Hello Biostars community,
I'm currently using Biopython to download sequences from NCBI with the Entrez.efetch
function, and I'm looking for guidance on how to include the NCBI taxonomic ID (txid) in each downloaded sequence's header. Below is a simplified version of the code I am using:
from Bio import Entrez, SeqIO
from tqdm import tqdm
import urllib.error
# Configure Entrez email
Entrez.email = 'my_email@example.com'
# File with sequence IDs
seq_ids_file = "sequence_ids.txt" # Input file with sequence IDs
seq_file = "sequences.fasta" # Output FASTA file
records = []
# Read IDs from input file
with open(seq_ids_file) as f:
lines = f.readlines()
# Download sequences with a progress bar
for line in tqdm(lines, desc=f"Downloading sequences"):
seq_id = line.strip()
try:
handle = Entrez.efetch(db="sequences", id=seq_id, rettype="gb", retmode="txt")
seq_record = SeqIO.read(handle, "gb")
# Add logic here to include the taxonomic ID in the sequence header
records.append(seq_record)
except (urllib.error.HTTPError, ValueError) as err:
print(f"Error downloading sequence {seq_id}: {err}")
continue
# Write the downloaded sequences to a FASTA file
if records:
SeqIO.write(records, seq_file, "fasta")
print(f"{len(records)} sequences written to {seq_file}.")
else:
print(f"No sequences downloaded.")
Currently, the header format is "ID name [organism]"
, but I need to modify it to include the NCBI txid. How can I extract the taxonomic ID from the sequence data or a related query and include it in the header (e.g., "ID name [organism] [txid]"
)?
Any tips, code snippets, or guidance would be greatly appreciated. Thanks in advance!