Question

Which software is suitable for protein prediction from whole eukaryotic genomes?

0

Entering edit mode

5.6 years ago

carina2817 ▴ 20

Hello,

I want to try programms for Orthologous prediction (OrthoMCL for example), the programs work with proteomes but I have some genomes of eukaryotic organisms for which I don't have proteomic data, Could you please recomend me some software to predict protein sequences from whole genomes?

protein prediction eukaryotic genome • 1.8k views

ADD COMMENT • link updated 5.6 years ago by gb ★ 2.2k • written 5.6 years ago by carina2817 ▴ 20

score 1 · Answer 1 · 2019-10-29

1

Entering edit mode

5.6 years ago

Juke34 9.2k

If you have the structural annotation (GFF,gtf) of these genomes you can extract the proteomes, otherwise you will have to perform the annotation yourself and this is a not the same story. Depending the species and the data available it could be quite complex. Here a list of gene prediction tools.

ADD COMMENT • link 5.6 years ago by Juke34 9.2k

0

Entering edit mode

Hi, thank you for your answer. I am trying to understand how to get proteomes using gff files (I found this files for some of my species). I read a post where someone was asking how to do this (https://bioinformatics.stackexchange.com/questions/6865/can-a-gff-file-be-converted-to-a-fasta-file) and someone recomended gffread (http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread_ex), in that page there is this example:

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

I think it is suposed to get the sequences of all transcripts in gff file using the genome as reference, if I understand well I should get the sequences of the features named "protein" or "transcript" or "CDS" in the gff file, but I used grep to look for these words("protein", "transcript", "CDS" and "gene") and I don't get results, I looked at my gff files and all I see is "region" as feature. So I guess this files won't be useful, right? then I have to get the annotation with Augustus (or similar) ...

ADD REPLY • link 5.6 years ago by carina2817 ▴ 20

0

Entering edit mode

You have extracted the transcripts using the -w option. you need to use the -y option for proteins.
Could you show few line of your transcripts.gtf file and few lines of what you get as output?

ADD REPLY • link 5.6 years ago by Juke34 9.2k

0

Entering edit mode

The gff file looks like this:

gff-version 3 !gff-spec-version 1.21 !processor NCBI annotwriter !genome-build ASM263302v1 !genome-build-accession NCBI_Assembly:GCA_002633025.1 sequence-region NMRB01000001.1 1 1576180 species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=416868 NMRB01000001.1 Genbank region 1 1576180 . + . ID=id0;Dbxref=taxon:416868;collected-by=Miyuki Kanda;collection-date=2013-08-01;country=Japan: Okayama%2C Ushimado;dev-stage=adult;gbkey=Src;identified-by=Tadashi Akiyama;mol_type=genomic DNA;strain=Ushimado;tissue-type=Whole animal sequence-region NMRB01000002.1 1 1458336 species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=416868 NMRB01000002.1 Genbank region 1 1458336 . + . ID=id1;Dbxref=taxon:416868;collected-by=Miyuki Kanda;collection-date=2013-08-01;country=Japan: Okayama%2C Ushimado;dev-stage=adult;gbkey=Src;identified-by=Tadashi Akiyama;mol_type=genomic DNA;strain=Ushimado;tissue-type=Whole animal

gffread -y proteins.fa -g ./GCA_002633025.1_ASM263302v1_genomic_Notospermus_geniculatus.fna GCA_002633025.1_ASM263302v1_genomic_Notospermus_geniculatus.gff

gffread -w transcripts.fa -g ./GCA_002633025.1_ASM263302v1_genomic_Notospermus_geniculatus.fna GCA_002633025.1_ASM263302v1_genomic_Notospermus_geniculatus.gff

proteins.fa and transcripts.fa are empty, so I guess there is no annotation for proteins in the file...

ADD REPLY • link 5.6 years ago by carina2817 ▴ 20

1

Entering edit mode

I think you are in the same case as in this post C: IGB won't open .gff file..
The gff file does not contain any prediction features (gene, mRNA, exon, CDS, UTRs, etc...) but only sequence/region description.

To quickly check the type of feature present in your file (column 3) you can do:
awk '{if($0 !~ /^#/) print $3}' GCA_002633025.1_ASM263302v1_genomic_Notospermus_geniculatus.gff | sort -u

I you don't find a proper gff/gtf annotation file I'm afraid you will have to perform the annotation yourself.

ADD REPLY • link 5.6 years ago by Juke34 9.2k

score 1 · Answer 2 · 2019-10-29

1

Entering edit mode

5.6 years ago

gb ★ 2.2k

Here a list of options:

https://en.wikipedia.org/wiki/List_of_gene_prediction_software

I used august before and that was easy depending on the organism.

ADD COMMENT • link 5.6 years ago by gb ★ 2.2k

0

Entering edit mode

Yes Augustus is a good choice if an hmm model for a species not too diverged from the species you want to annotate exists.

ADD REPLY • link 5.6 years ago by Juke34 9.2k