Question

How to query NCBI to extract Virus fasta files using BioPython?

0

Entering edit mode

12 months ago

Bioinformatics_begginner ▴ 20

Hi !

I want to extract the genome fasta files of 30 samples automatically using python script from here https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=10239&host=bacteria.

I want the virusus that have has host bacteria and I am using BioPython Package.

Entrez.email = "mail"
taxid= 10239
host= "bacteria"
num_records = 20
query = f'taxid:{taxid}[Organism] AND host:{host}[All Fields]'

# Use Entrez.esearch to search for genomes based on the query
handle = Entrez.esearch(db="genome", term=query, retmax=100000)  # Increase retmax if needed

# Use Entrez.read to parse the search results
record = Entrez.read(handle)
print(record)

# Get a random subset of genome IDs
genome_ids = random.sample(record["IdList"], min(num_records, len(record["IdList"])))
print(genome_ids[:3])

This is my current code, it is not retrieving any id. What could am I doing wrong? And what can i add to get the fasta and download them?

Thank you.

Python BioPython NCBI Virus • 696 views

ADD COMMENT • link updated 12 months ago by josev.die ▴ 70 • written 12 months ago by Bioinformatics_begginner ▴ 20

score 2 · Answer 1 · 2023-12-13

Not Biopython but you can use EntrezDirect.

Download the list of accessions using the Download button at the link you posted above (assuming that is the search you wanted, there are 4000+ genomes not 30 as you mention above). Put them in a file, one accession per line.

$ cat id
NC_001341
NC_028834
NC_023556

Following will give you individual fasta files for the genomes.

$ for i in `cat id`; do echo ${i}; efetch -db nuccore -id ${i} -format fasta > ${i}.fasta; done

If you want to get a single file with all data

$ for i in `cat id`; do echo ${i}; efetch -db nuccore -id ${i} -format fasta >> virii.fasta; done

score 1 · Answer 2 · 2023-12-13

For R fans, here the translation with rentrez, which uses the NCBI API Eutils:

# Dependencies
library(Biostrings)
library(rentrez)

# Write a function 
get_sequence <- function(id) {
 target = rentrez::entrez_fetch(db = "nuccore", id = id, rettype = "fasta")
 target_tidy = strsplit(target, "\n")
 my_seq <- as.character(paste0(target_tidy[[1]][2:length(target_tidy[[1]])], collapse = ""))
 my_seq }

# Accession ids 
ids <- c("NC_001341", "NC_028834", "NC_023556")

# Run the function over your ids 
my_set <- sapply(ids, function(i) get_sequence(i))

# Result 
res <- DNAStringSet(my_set)

# Write a multi-fasta file with the sequences
writeXStringSet(res, "my_seqs.fa")