Question

making estimated proteome from genome

0

Entering edit mode

5 hours ago

dec986 ▴ 380

I have a massive genome (onion): GCA_030765085.1_ASM3076508v1

I attempted to make a blast database with makeblastdb, but it seems to have failed, as the genome is too large. So I figured that using a proteome is necessary.

However, NCBI doesn't provide a proteome, so I need to make my own.

I've asked a similar question before: how to make estimated proteome from genome? but none are working

A. I have tried Augustus:

    augustus --species=arabidopsis GCA_030765085.1_ASM3076508v1_genomic.fna > GCA_030765085.1_ASM3076508v1.augustus.arabidopsis.faa
    augustus: error while loading shared libraries: libboost_iostreams.so.1.85.0: cannot open shared object file: No such file or directory

B. prodigal:

    prodigal -i GCA_030765085.1_ASM3076508v1_genomic.fna -o GCA_030765085.1_ASM3076508v1.prodigal.protein.faa
    -------------------------------------
    PRODIGAL v2.6.3 [February, 2016]         
    Univ of Tenn / Oak Ridge National Lab
    Doug Hyatt, Loren Hauser, et al.     
    -------------------------------------
    Request:  Single Genome, Phase:  Training
    Reading in the sequence(s) to train...

    Warning:  Sequence is long (max 32000000 for training).
    Training on the first 32000000 bases.

    31990000 bp seq created, 41.21 pct GC
    Locating all potential starts and stops...1453037 nodes
    Looking for GC bias in different frames...frame bias scores: 1.60 0.44 0.96
    Building initial set of genes to train from...done!
    Creating coding model and scoring nodes...done!
    Examining upstream regions and training starts...done!
    -------------------------------------
    Request:  Single Genome, Phase:  Gene Finding
    Sequence too long (max 32000000 permitted).

C. ExPasy but I can't use that on a website, I need CLI

D. GeneMark but I don't know how to install that on a system that I don't have root permission for

The only thing that I can think of is to use prodigal on each of the 2099 individual contigs, and then combine them later.

What is the best approach to generate a proteome from an enormous genome when the proteome is not available from NCBI?

NCBI • 72 views

ADD COMMENT • link updated 1 hour ago by Mensur Dlakic ★ 28k • written 5 hours ago by dec986 ▴ 380

0

Entering edit mode

The only thing that I can think of is to use prodigal on each of the 2099 individual contigs, and then combine them later.

If it works then go for that. Use combinations of more than one contig and make sure the length stay below 32000000 for each input.

ADD REPLY • link 4 hours ago by GenoMax 147k

score 0 · Answer 1 · 2024-11-21

Prodigal is for prokaryotic gene prediction.

I attempted to make a blast database with makeblastdb, but it seems to have failed, as the genome is too large. So I figured that using a proteome is necessary.

I suggest you get a more recent version of makeblastdb (try makeblastdb -version). It is a warped logic, to say the least, that predicting a eukaryotic proteome will be a better choice than figuring out how to make a genome BLAST database.

What is the best approach to generate a proteome from an enormous genome when the proteome is not available from NCBI?

This is a non-trivial task, even for people who know what they are doing. It is not something that can be done at the snap of fingers: it requires knowledge, appropriate software and resources. My advice is to focus on creating a BLAST database. You may have to specify larger database size with makeblastdb, something like -max_file_sz 5GB or -max_file_sz 10GB.