Question

making estimated proteome from genome

0

Entering edit mode

4 weeks ago

dec986 ▴ 380

I have a massive genome (onion): GCA_030765085.1_ASM3076508v1

I attempted to make a blast database with makeblastdb, but it seems to have failed, as the genome is too large. So I figured that using a proteome is necessary.

However, NCBI doesn't provide a proteome, so I need to make my own.

I've asked a similar question before: how to make estimated proteome from genome? but none are working

A. I have tried Augustus:

    augustus --species=arabidopsis GCA_030765085.1_ASM3076508v1_genomic.fna > GCA_030765085.1_ASM3076508v1.augustus.arabidopsis.faa
    augustus: error while loading shared libraries: libboost_iostreams.so.1.85.0: cannot open shared object file: No such file or directory

B. prodigal:

    prodigal -i GCA_030765085.1_ASM3076508v1_genomic.fna -o GCA_030765085.1_ASM3076508v1.prodigal.protein.faa
    -------------------------------------
    PRODIGAL v2.6.3 [February, 2016]         
    Univ of Tenn / Oak Ridge National Lab
    Doug Hyatt, Loren Hauser, et al.     
    -------------------------------------
    Request:  Single Genome, Phase:  Training
    Reading in the sequence(s) to train...

    Warning:  Sequence is long (max 32000000 for training).
    Training on the first 32000000 bases.

    31990000 bp seq created, 41.21 pct GC
    Locating all potential starts and stops...1453037 nodes
    Looking for GC bias in different frames...frame bias scores: 1.60 0.44 0.96
    Building initial set of genes to train from...done!
    Creating coding model and scoring nodes...done!
    Examining upstream regions and training starts...done!
    -------------------------------------
    Request:  Single Genome, Phase:  Gene Finding
    Sequence too long (max 32000000 permitted).

C. ExPasy but I can't use that on a website, I need CLI

D. GeneMark but I don't know how to install that on a system that I don't have root permission for

The only thing that I can think of is to use prodigal on each of the 2099 individual contigs, and then combine them later.

What is the best approach to generate a proteome from an enormous genome when the proteome is not available from NCBI?

NCBI • 545 views

ADD COMMENT • link updated 4 weeks ago by Mensur Dlakic ★ 28k • written 4 weeks ago by dec986 ▴ 380

score 2 · Answer 1 · 2024-11-21

2

Entering edit mode

4 weeks ago

Mensur Dlakic ★ 28k

Prodigal is for prokaryotic gene prediction.

I attempted to make a blast database with makeblastdb, but it seems to have failed, as the genome is too large. So I figured that using a proteome is necessary.

I suggest you get a more recent version of makeblastdb (try makeblastdb -version). It is a warped logic, to say the least, that predicting a eukaryotic proteome will be a better choice than figuring out how to make a genome BLAST database.

What is the best approach to generate a proteome from an enormous genome when the proteome is not available from NCBI?

This is a non-trivial task, even for people who know what they are doing. It is not something that can be done at the snap of fingers: it requires knowledge, appropriate software and resources. My advice is to focus on creating a BLAST database. You may have to specify larger database size with makeblastdb, something like -max_file_sz 5GB or -max_file_sz 10GB.

ADD COMMENT • link 4 weeks ago by Mensur Dlakic ★ 28k

1

Entering edit mode

I'm personally fully with Mensur Dlakic here on this topic.

Going for 'let's quickly annotate the genome because the blastdb is not working" is a seriously twisted reasoning ;-) (and I'm saying this with nearly 20y experience in genome annotation) .

Moreover, apart from the technical aspects , it might not even make sense from a 'biological' point ... There are reasons (/analyses) for which you will need a genome nucleotide blast DB and thus the proteome DB will be of little or no use at all.

Bottom line: keep looking for fixing the blastDB creation issue rather than building a different kind of blastDB !!

If you post (or make a post on) the issue you have for making the blastDB , perhaps we can resolve them?

ADD REPLY • link 4 weeks ago by lieven.sterck 15k

0

Entering edit mode

I attempted increasing the max file size, but I get an error with makeblastdb 2.16.0:

BLAST options error: max_file_sz must be < 4 GiB

the genome file from NCBI is 16 GB

ADD REPLY • link 4 weeks ago by dec986 ▴ 380

0

Entering edit mode

Try with a size smaller than 4 GiB. It will make more or less files as needed.

ADD REPLY • link 4 weeks ago by GenoMax 148k

0

Entering edit mode

To be clear: max_file_sz has nothing to do with genome size, but rather with the size of file chunks in the resulting BLAST database. Others may have better advice on this, but I think that max_file_sz must be larger than the largest individual contig, plus a bit extra. You can always try sizes up to 4GB, as it is unlikely that there is a contig/chromosome of that size.

I think you should tell us the command you used and the resulting error during database creation.

ADD REPLY • link 4 weeks ago by Mensur Dlakic ★ 28k

score 0 · Answer 2 · 2024-11-21

0

Entering edit mode

4 weeks ago

colindaven 7.0k

You can install augustus through bioconda. https://bioconda.github.io/recipes/augustus/README.html#package-augustus

Also please tell use about the computer or server you're using. I think you'll need at least 64GB RAM to work with a genome of that size.

ADD COMMENT • link 4 weeks ago by colindaven 7.0k

0

Entering edit mode

this is a massive HPC, with > 100GB RAM, and 40 CPU. I share with others, of course

ADD REPLY • link 4 weeks ago by dec986 ▴ 380

0

Entering edit mode

As long as you are able to ask for ~64GB of RAM for your job you should be able to run Augustus. If you are not allowed to use that much RAM for then this option may not work either.

ADD REPLY • link 4 weeks ago by GenoMax 148k