Question

Issue with making local BLAST database

0

Entering edit mode

4 weeks ago

Begonia_pavonina ▴ 150

I wish to make a local database of diatoms mitochondrial large sub-units with the R script below. My criteria of research are:

large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields]

On the NCBI database, this search gives 2,285,123 sequences matching.

https://www.ncbi.nlm.nih.gov/nuccore/?term=large%20subunit%20ribosomal

But in the output of my script there are only 20 entries in the database. Would anyone know why this discrepancy? Did I miss something obvious?

# Define search term and filters
search_term <- "large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields] AND diatoms[All Fields]"
db <- "nucleotide"

# Perform the search
search_results <- entrez_search(db, term = search_term)

# Fetch sequences
sequences <- entrez_fetch(db, id = search_results$ids, rettype = "fasta")

# Write sequences to file
writeLines(sequences, "../data/diatom_sequences_R.fasta")

# Make a database from this fasta file:
fasta_file <- "../data/diatom_sequences_R.fasta"
db_name <- "../data/db_diatoms_seq"

# Create BLAST database
system2("makeblastdb", args = c("-in",
                                fasta_file, 
                                "-dbtype", 
                                "nucl", 
                                "-parse_seqids", 
                                "-out", 
                                db_name))

R mitochodrial diatoms blast ncbi • 234 views

ADD COMMENT • link updated 29 days ago by Ram 43k • written 4 weeks ago by Begonia_pavonina ▴ 150

Ram · Answer 1 · 2024-03-28

1

Entering edit mode

4 weeks ago

GenoMax 141k

By default only 20 entries are returned if you don't set the retmax parameter. You will not want to do this via R since you will be limited to 10000 entries max (see: https://www.ncbi.nlm.nih.gov/books/NBK25499/ ) Use command line EntrezDirect or extract the sequences using taxID from pre-formatted blast database.

ADD COMMENT • link 4 weeks ago by GenoMax 141k

0

Entering edit mode

Thank you GenoMax, if I have well understood the retmax parameter and 10000 entries limitation are linked to the R package, and not to the command line EntrezDirect.

I have re-written the script to accommodate it:

# Define search term and database
search_term <- "large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields] AND diatoms[All Fields]"
db <- "nucleotide"

# Use Esearch
system2("esearch", args = c("-db", db, "-query", search_term, "|", "efetch", "-format", "uid", "|", "awk", "'{printf \"%s,\", $0}'", "|", "sed", "'s/,$//' > ids.txt"), wait = TRUE)

# Use Efetch
system2("efetch", args = c("-db", db, "-id", "$(cat ids.txt)", "-format", "fasta", ">", "sequences.fasta"), wait = TRUE)

# Cleanup temporary files
file.remove("ids.txt")

ADD REPLY • link updated 29 days ago by Ram 43k • written 4 weeks ago by Begonia_pavonina ▴ 150