How to query NCBI to extract Virus fasta files using BioPython?
2
0
Entering edit mode
11 months ago

Hi !

I want to extract the genome fasta files of 30 samples automatically using python script from here https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=10239&host=bacteria.

I want the virusus that have has host bacteria and I am using BioPython Package.

Entrez.email = "mail"
taxid= 10239
host= "bacteria"
num_records = 20
query = f'taxid:{taxid}[Organism] AND host:{host}[All Fields]'

# Use Entrez.esearch to search for genomes based on the query
handle = Entrez.esearch(db="genome", term=query, retmax=100000)  # Increase retmax if needed

# Use Entrez.read to parse the search results
record = Entrez.read(handle)
print(record)

# Get a random subset of genome IDs
genome_ids = random.sample(record["IdList"], min(num_records, len(record["IdList"])))
print(genome_ids[:3])

This is my current code, it is not retrieving any id. What could am I doing wrong? And what can i add to get the fasta and download them?

Thank you.

Python BioPython NCBI Virus • 655 views
ADD COMMENT
2
Entering edit mode
11 months ago
GenoMax 147k

Not Biopython but you can use EntrezDirect.

Download the list of accessions using the Download button at the link you posted above (assuming that is the search you wanted, there are 4000+ genomes not 30 as you mention above). Put them in a file, one accession per line.

$ cat id
NC_001341
NC_028834
NC_023556

Following will give you individual fasta files for the genomes.

$ for i in `cat id`; do echo ${i}; efetch -db nuccore -id ${i} -format fasta > ${i}.fasta; done

If you want to get a single file with all data

$ for i in `cat id`; do echo ${i}; efetch -db nuccore -id ${i} -format fasta >> virii.fasta; done
ADD COMMENT
1
Entering edit mode
11 months ago
josev.die ▴ 70

For R fans, here the translation with rentrez, which uses the NCBI API Eutils:

# Dependencies
library(Biostrings)
library(rentrez)

# Write a function 
get_sequence <- function(id) {
 target = rentrez::entrez_fetch(db = "nuccore", id = id, rettype = "fasta")
 target_tidy = strsplit(target, "\n")
 my_seq <- as.character(paste0(target_tidy[[1]][2:length(target_tidy[[1]])], collapse = ""))
 my_seq }

# Accession ids 
ids <- c("NC_001341", "NC_028834", "NC_023556")

# Run the function over your ids 
my_set <- sapply(ids, function(i) get_sequence(i))

# Result 
res <- DNAStringSet(my_set)

# Write a multi-fasta file with the sequences
writeXStringSet(res, "my_seqs.fa") 
ADD COMMENT

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6