I wish to make a local database of diatoms mitochondrial large sub-units with the R script below. My criteria of research are:
large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields]
On the NCBI database, this search gives 2,285,123 sequences matching.
https://www.ncbi.nlm.nih.gov/nuccore/?term=large%20subunit%20ribosomal
But in the output of my script there are only 20 entries in the database. Would anyone know why this discrepancy? Did I miss something obvious?
# Define search term and filters
search_term <- "large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields] AND diatoms[All Fields]"
db <- "nucleotide"
# Perform the search
search_results <- entrez_search(db, term = search_term)
# Fetch sequences
sequences <- entrez_fetch(db, id = search_results$ids, rettype = "fasta")
# Write sequences to file
writeLines(sequences, "../data/diatom_sequences_R.fasta")
# Make a database from this fasta file:
fasta_file <- "../data/diatom_sequences_R.fasta"
db_name <- "../data/db_diatoms_seq"
# Create BLAST database
system2("makeblastdb", args = c("-in",
fasta_file,
"-dbtype",
"nucl",
"-parse_seqids",
"-out",
db_name))
Thank you GenoMax, if I have well understood the
retmax
parameter and 10000 entries limitation are linked to the R package, and not to the command lineEntrezDirect
.I have re-written the script to accommodate it: