I have a massive genome (onion): GCA_030765085.1_ASM3076508v1
I attempted to make a blast database with makeblastdb, but it seems to have failed, as the genome is too large. So I figured that using a proteome is necessary.
However, NCBI doesn't provide a proteome, so I need to make my own.
I've asked a similar question before: how to make estimated proteome from genome? but none are working
A. I have tried Augustus:
augustus --species=arabidopsis GCA_030765085.1_ASM3076508v1_genomic.fna > GCA_030765085.1_ASM3076508v1.augustus.arabidopsis.faa
augustus: error while loading shared libraries: libboost_iostreams.so.1.85.0: cannot open shared object file: No such file or directory
B. prodigal:
prodigal -i GCA_030765085.1_ASM3076508v1_genomic.fna -o GCA_030765085.1_ASM3076508v1.prodigal.protein.faa
-------------------------------------
PRODIGAL v2.6.3 [February, 2016]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.
-------------------------------------
Request: Single Genome, Phase: Training
Reading in the sequence(s) to train...
Warning: Sequence is long (max 32000000 for training).
Training on the first 32000000 bases.
31990000 bp seq created, 41.21 pct GC
Locating all potential starts and stops...1453037 nodes
Looking for GC bias in different frames...frame bias scores: 1.60 0.44 0.96
Building initial set of genes to train from...done!
Creating coding model and scoring nodes...done!
Examining upstream regions and training starts...done!
-------------------------------------
Request: Single Genome, Phase: Gene Finding
Sequence too long (max 32000000 permitted).
C. ExPasy but I can't use that on a website, I need CLI
D. GeneMark but I don't know how to install that on a system that I don't have root permission for
The only thing that I can think of is to use prodigal on each of the 2099 individual contigs, and then combine them later.
What is the best approach to generate a proteome from an enormous genome when the proteome is not available from NCBI?
I'm personally fully with Mensur Dlakic here on this topic.
Going for 'let's quickly annotate the genome because the blastdb is not working" is a seriously twisted reasoning ;-) (and I'm saying this with nearly 20y experience in genome annotation) .
Moreover, apart from the technical aspects , it might not even make sense from a 'biological' point ... There are reasons (/analyses) for which you will need a genome nucleotide blast DB and thus the proteome DB will be of little or no use at all.
Bottom line: keep looking for fixing the blastDB creation issue rather than building a different kind of blastDB !!
If you post (or make a post on) the issue you have for making the blastDB , perhaps we can resolve them?
I attempted increasing the max file size, but I get an error with makeblastdb 2.16.0:
the genome file from NCBI is 16 GB
Try with a size smaller than 4 GiB. It will make more or less files as needed.
To be clear:
max_file_sz
has nothing to do with genome size, but rather with the size of file chunks in the resulting BLAST database. Others may have better advice on this, but I think thatmax_file_sz
must be larger than the largest individual contig, plus a bit extra. You can always try sizes up to 4GB, as it is unlikely that there is a contig/chromosome of that size.I think you should tell us the command you used and the resulting error during database creation.