Question

Error While Running Cegma (Geneid-Train Step)

0

Entering edit mode

11.2 years ago

arnstrm ★ 1.9k

I am trying to run CEGMA on the newly assembled genome (scaffolds) and I have trouble getting past the geneid step. I ran CEGMA with default parameters, as cegma --ext -g genome.scf.fasta. The pipeline ran for about 10 hours (32 procs, 256 GB RAM) and exited giving this: CEGMA, geneid error: geneid-train did not work properly. When I investigated, I found it was geneid-train step. So, I tried to run it manually as:

$ geneid-train -v local.geneid.selected.gff local.geneid.selected.dna geneid_params
DATA COLLECTED: 298 Coding sequences containing 1311 introns
Intron model
Coding model 
Use of uninitialized value in numeric eq (==) at /data004/software/GIF/packages/cegma/2.4.010312/lib/geneid.pm line 264.
some values in Markov model with zero counts, use pseudocounts at /data004/software/GIF/packages/cegma/2.4.010312/lib/geneid.pm line 270.

Does anybody have any experience with geneid? How can I get past this step? My genome's estimated size is 745 MB and has about 574K scaffolds (643 MB total length), N50=607. Any input will greatly be appreciated!

Thanks

training prediction • 4.2k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 11.2 years ago by arnstrm ★ 1.9k

0

Entering edit mode

Your N50 is 607 bp?!? Or did you mean 607 Kbp. If it really is 607 bp then you should probably not be doing anything else with that genome assembly.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by keith ▴ 130

0

Entering edit mode

LOL, no! It was the preliminary assembly. Later it was improved and the numbers are much better now.

Still a work in progress:

ADD REPLY • link 10.1 years ago by arnstrm ★ 1.9k

0

Entering edit mode

11.1 years ago

keith ▴ 130

This hasn't actually been resolved yet...I'm still working on what might be going on.

ADD COMMENT • link 11.1 years ago by keith ▴ 130

0

Entering edit mode

10.0 years ago

Francisco Camara • 0

Hi Keith and arnstrm,

I am not sure whether this issue has been sorted but I can try to help if it hasn't. The CEGMA pipeline was developed by from a group other than ours. However it uses geneid, which an ab initio gene prediction tool developed in our group.

I have used CEGMA quite extensively to determine the quality of the protein-coding "gene-space" of different genomes but I don think I have run into this sort of problem...

It sounds like some of your input fastas may have no sequence content but I am not sure..Could you give me access to the intermediate files including the geneid parameter file (self.param) so that I could try to figure out what's wrong?

I assume I have the latest version of geneid installed on the system (1.4.4 -check here: http://genome.crg.es/software/geneid/index.html) and/or your CEGMA installation is pointing to it.

Thanks,
Francisco Camara

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.0 years ago by Francisco Camara • 0

0

Entering edit mode

Thanks, Francisco! The issue has been resolved in version >2.5 (see this post for details). Keith (as in Keith Bradnam) is one of the CEGMA developer!

EDIT: didn't realize, I was talking to geneid developer as well. I am so happy now!

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.0 years ago by arnstrm ★ 1.9k

score 1 · Accepted Answer · 2014-05-02

1

Entering edit mode

11.1 years ago

arnstrm ★ 1.9k

This has been resolved on this Trouble running CEGMA on the sample dataset!

ADD COMMENT • link 11.1 years ago by arnstrm ★ 1.9k