Build NCBI blast database with a specific genus or species
2
0
Entering edit mode
16 months ago
ThePresident ▴ 180

I would like to set a local blast workflow using the NCBI's BLAST+ suite.

I downloaded the latest (v2.14) BLAST suite and would like to set up a database. I am interested by a specific bacterial genus and I am puzzled as how to set up a database that contains only sequences from that genus.

From this post, I know I could download nt_prok database and then blast with a list of GI accession numbers that relate to my organism of interest. However, the nt_prok database is around 50Gb and I don't have that space (and can't use a removable hard drive).

Is there a way to build a custom blast database that contains only sequences of interest?

ncbi blast • 2.1k views
ADD COMMENT
0
Entering edit mode

contains only sequences from that genus

Are you interested in sequences from whole genome or any sequences from that genus that is in GenBank? Mensur has covered the whole genome aspect if that is what you wanted.

ADD REPLY
0
Entering edit mode

Whole genome, scaffolds and contigs - Mensur's solution seems to work.

ADD REPLY
5
Entering edit mode
16 months ago
Mensur Dlakic ★ 28k

There is a script for massive downloads from NCBI:

https://github.com/pirovc/genome_updater

Let's say that you are interested in genus Staphylococcus which has a taxonomy ID 1279. This command will download all genomic DNA files (complete and incomplete) for this taxID and store in directory Staph. Running the script by itself will give you info for how to customize the options.

genome_updater.sh -d "refseq" -g "taxids:1279" -c "all" -l "all" -f "genomic.fna.gz" -o Staph -t 10 -u -m -a
ADD COMMENT
0
Entering edit mode

It appears the script arguments changed slighly, but overall, I was able to download what I needed. Pretty neat tool - I wuld have though that NCBI's BLAST+ or Entrez should have that as well. Edit: They do - see GenoMax's answer.

This is the command line with new argments:

genome_updater.sh -d "refseq" -g "bacteria" -T 1313 -f "genomic.fna.gz" -o Staph -t 10 -u -m -a

Now, any experience with building the custom blast database with these sequences? I will refer to this post, but NCBI is notoriously bad with explanations.

ADD REPLY
0
Entering edit mode

What are you planning to use as a query? Using BLAST may not be an appropriate choice for some queries. Don't know how many genomes you downloaded but you had better have some good infrastructure (especially memory) to create and use blast with genomes.

ADD REPLY
0
Entering edit mode

I'd like to assess the presence (and homology) of a list of genes of interest in the species population. For example, I'd like to know how well gene A is conserved within the species (e.g., gene A found in 83% of sequenced isolates at 95% identity cutoff, and 96% conserved at 75% identity cutoff etc.)

My idea was to download the database (my species of interest has about 8,000 sequenced genomes), and then do local blastn or tblastx.

Is there a better solution?

ADD REPLY
0
Entering edit mode

You could potentially try minimap2 (or BLAT) to do the initial alignments to discern if the particular gene is present in the genome. This would likely be fast and less resource intensive. Then you could follow that up with BLAST to see finer local alignments.

ADD REPLY
2
Entering edit mode
16 months ago
GenoMax 147k

You could also use NCBI datasets tool (LINK).

$ datasets download genome taxon "1279"

Looks like there are 120K genomes.

So if you want to keep the number down then you may want to get RefSeq genomes only.

$ datasets download genome taxon "1279" --assembly-source RefSeq

this cuts the number down to 20.5K genomes.

ADD COMMENT
0
Entering edit mode

I downloaded the tools using curl, gave permission and get this error when trying to run datasets: datasets: command not found Running from the directory where the files are stored.

ADD REPLY
0
Entering edit mode

You should ideally put the programs into a directory that is included in $PATH. In a pinch, use ./datasets instead of just datasets.

ADD REPLY

Login before adding your answer.

Traffic: 2677 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6