I have a massive genome (onion): GCA_030765085.1_ASM3076508v1
I attempted to make a blast database with makeblastdb, but it seems to have failed, as the genome is too large. So I figured that using a proteome is necessary.
However, NCBI doesn't provide a proteome, so I need to make my own.
I've asked a similar question before: how to make estimated proteome from genome? but none are working
A. I have tried Augustus:
augustus --species=arabidopsis GCA_030765085.1_ASM3076508v1_genomic.fna > GCA_030765085.1_ASM3076508v1.augustus.arabidopsis.faa
augustus: error while loading shared libraries: libboost_iostreams.so.1.85.0: cannot open shared object file: No such file or directory
B. prodigal:
prodigal -i GCA_030765085.1_ASM3076508v1_genomic.fna -o GCA_030765085.1_ASM3076508v1.prodigal.protein.faa
-------------------------------------
PRODIGAL v2.6.3 [February, 2016]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.
-------------------------------------
Request: Single Genome, Phase: Training
Reading in the sequence(s) to train...
Warning: Sequence is long (max 32000000 for training).
Training on the first 32000000 bases.
31990000 bp seq created, 41.21 pct GC
Locating all potential starts and stops...1453037 nodes
Looking for GC bias in different frames...frame bias scores: 1.60 0.44 0.96
Building initial set of genes to train from...done!
Creating coding model and scoring nodes...done!
Examining upstream regions and training starts...done!
-------------------------------------
Request: Single Genome, Phase: Gene Finding
Sequence too long (max 32000000 permitted).
C. ExPasy but I can't use that on a website, I need CLI
D. GeneMark but I don't know how to install that on a system that I don't have root permission for
The only thing that I can think of is to use prodigal on each of the 2099 individual contigs, and then combine them later.
What is the best approach to generate a proteome from an enormous genome when the proteome is not available from NCBI?
If it works then go for that. Use combinations of more than one contig and make sure the length stay below
32000000
for each input.