Issue with making local BLAST database
1
0
Entering edit mode
4 weeks ago

I wish to make a local database of diatoms mitochondrial large sub-units with the R script below. My criteria of research are:

large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields]

On the NCBI database, this search gives 2,285,123 sequences matching.

https://www.ncbi.nlm.nih.gov/nuccore/?term=large%20subunit%20ribosomal

But in the output of my script there are only 20 entries in the database. Would anyone know why this discrepancy? Did I miss something obvious?

# Define search term and filters
search_term <- "large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields] AND diatoms[All Fields]"
db <- "nucleotide"

# Perform the search
search_results <- entrez_search(db, term = search_term)

# Fetch sequences
sequences <- entrez_fetch(db, id = search_results$ids, rettype = "fasta")

# Write sequences to file
writeLines(sequences, "../data/diatom_sequences_R.fasta")

# Make a database from this fasta file:
fasta_file <- "../data/diatom_sequences_R.fasta"
db_name <- "../data/db_diatoms_seq"

# Create BLAST database
system2("makeblastdb", args = c("-in",
                                fasta_file, 
                                "-dbtype", 
                                "nucl", 
                                "-parse_seqids", 
                                "-out", 
                                db_name))
R mitochodrial diatoms blast ncbi • 234 views
ADD COMMENT
1
Entering edit mode
4 weeks ago
GenoMax 141k

By default only 20 entries are returned if you don't set the retmax parameter. You will not want to do this via R since you will be limited to 10000 entries max (see: https://www.ncbi.nlm.nih.gov/books/NBK25499/ ) Use command line EntrezDirect or extract the sequences using taxID from pre-formatted blast database.

ADD COMMENT
0
Entering edit mode

Thank you GenoMax, if I have well understood the retmax parameter and 10000 entries limitation are linked to the R package, and not to the command line EntrezDirect.

I have re-written the script to accommodate it:

# Define search term and database
search_term <- "large[All Fields] AND subunit[All Fields] AND ribosomal[All Fields] AND diatoms[All Fields]"
db <- "nucleotide"

# Use Esearch
system2("esearch", args = c("-db", db, "-query", search_term, "|", "efetch", "-format", "uid", "|", "awk", "'{printf \"%s,\", $0}'", "|", "sed", "'s/,$//' > ids.txt"), wait = TRUE)

# Use Efetch
system2("efetch", args = c("-db", db, "-id", "$(cat ids.txt)", "-format", "fasta", ">", "sequences.fasta"), wait = TRUE)

# Cleanup temporary files
file.remove("ids.txt")
ADD REPLY

Login before adding your answer.

Traffic: 1760 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6