I'm hoping this doesn't come off as a lazy question, but I am wondering if anyone is aware of a tool or pipeline out there that directly takes reference genome (supercontigs, contigs, scaffold, what have you) and an annotation file (GFF, etc.) and goes "directly" to formatted BLAST database? I've searched around and haven't been able to find anything able to do this.
I'm in need of taking a LOT of microbial genomes and creating a BLAST database from all of the coding sequences. For this in silico experiment I'm benchmarking against NCBI databases so I want the contents to be different and can't just use other databases. I'm looking for a pipeline for specifically re-running the database creation portion over and over again.
This is basically what I am looking for would be script -input *.fasta -annotation *.gff -output blast.database
. I know I can do this using numerous tools (formatdb, etc.) by stitching them together in my own pipeline, but I wanted to see if someone had already developed a tool before I invested any time -- any input is appreciated!
Note formatdb is replaced by makeblastdb in BLAST+ and using either tool is it trivial to make a nucleotide BLAST database of your assembly FASTA file(s).
Could you clarify if you want a gene nucleotide BLAST database (which means first processing the annotation to make a FASTA file the gene coding sequences), or a gene protein BLAST database (which also means translating into amino acids)?
Thanks for the correction on the BLAST+ commands -- you saved me some trouble!
I'm hoping to have the capability of both nucleotide and protein for the analysis -- so I would need both. I think I might have to make my own pipeline.
Thanks for the input +1.