Hello,
I want to try programms for Orthologous prediction (OrthoMCL for example), the programs work with proteomes but I have some genomes of eukaryotic organisms for which I don't have proteomic data, Could you please recomend me some software to predict protein sequences from whole genomes?
Hi, thank you for your answer. I am trying to understand how to get proteomes using gff files (I found this files for some of my species). I read a post where someone was asking how to do this (https://bioinformatics.stackexchange.com/questions/6865/can-a-gff-file-be-converted-to-a-fasta-file) and someone recomended gffread (http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread_ex), in that page there is this example:
I think it is suposed to get the sequences of all transcripts in gff file using the genome as reference, if I understand well I should get the sequences of the features named "protein" or "transcript" or "CDS" in the gff file, but I used grep to look for these words("protein", "transcript", "CDS" and "gene") and I don't get results, I looked at my gff files and all I see is "region" as feature. So I guess this files won't be useful, right? then I have to get the annotation with Augustus (or similar) ...
You have extracted the transcripts using the
-w
option. you need to use the-y
option for proteins.Could you show few line of your transcripts.gtf file and few lines of what you get as output?
The gff file looks like this:
gff-version 3 !gff-spec-version 1.21 !processor NCBI annotwriter !genome-build ASM263302v1 !genome-build-accession NCBI_Assembly:GCA_002633025.1 sequence-region NMRB01000001.1 1 1576180 species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=416868 NMRB01000001.1 Genbank region 1 1576180 . + . ID=id0;Dbxref=taxon:416868;collected-by=Miyuki Kanda;collection-date=2013-08-01;country=Japan: Okayama%2C Ushimado;dev-stage=adult;gbkey=Src;identified-by=Tadashi Akiyama;mol_type=genomic DNA;strain=Ushimado;tissue-type=Whole animal sequence-region NMRB01000002.1 1 1458336 species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=416868 NMRB01000002.1 Genbank region 1 1458336 . + . ID=id1;Dbxref=taxon:416868;collected-by=Miyuki Kanda;collection-date=2013-08-01;country=Japan: Okayama%2C Ushimado;dev-stage=adult;gbkey=Src;identified-by=Tadashi Akiyama;mol_type=genomic DNA;strain=Ushimado;tissue-type=Whole animal
proteins.fa and transcripts.fa are empty, so I guess there is no annotation for proteins in the file...
I think you are in the same case as in this post C: IGB won't open .gff file..
The gff file does not contain any prediction features (gene, mRNA, exon, CDS, UTRs, etc...) but only sequence/region description.
To quickly check the type of feature present in your file (column 3) you can do:
awk '{if($0 !~ /^#/) print $3}' GCA_002633025.1_ASM263302v1_genomic_Notospermus_geniculatus.gff | sort -u
I you don't find a proper gff/gtf annotation file I'm afraid you will have to perform the annotation yourself.