Hi everyone,
I'm really a beginner and wonder how to get predicted protein sequence from a genome?
For example,
Here is a Genbank assembly of a genome (quite small), and inside there is no available protein sequence to download (e.g. XXX.pep.faa ). Then I try to predict protein sequence by myself useing software like EVM. But when I download the gff3 file of this genome, the gff3 doesn't seem like a standard gff3 file.Instead of seeing the information of CDS or exon, Isee a lot of 'Genbank: URL....'.
Some others suggest me to predict the protein sequence by predicting the ORF first and translate using standard codon.
Do you have any idea about this? Thanks for your time!
I've been looking at those GFF files and it seems they do not have information about proteins. For bacteria, which have no introns, you can predict ORFs, that's a nice starting point. For this organism (a myxozoan), I would first determine whether it contains introns or not. A blastx of these contigs against known proteins will help you answer this, and also will help you identify proteins.
Thanks aba! I think I've got your idea!
To evaluate whether intron exists in this Myxozoan file (An eukaryotes), I select some sequence from it and do blastx.
I did get some hits, like:
Range 1: 320 to 411
Range 2: 283 to 313
.....
So I found some of them contain several introns.(This makes sense because this is an eukaryotes)
So next step, for genome protein prediction, I will do blastx the genome against known database (Nr, Swiss e.g.).
In this way, I'll get some blast-based protein, but what should I do with the remained un-matched sequence?
For transcriptome, software like estscan or transdecoder can solve this, how about genome?
I would try some gene prediction software. Look for one that uses protein evidence (blastx or similar). I have no experience with this so I cannot recommend you one in particular, but that's what should be done for this case.
I've seen in NCBI's taxonomy, that there are already 15,204 proteins characterised for myxozoans (look for myxozoa and click on "Protein". These sequences would be the most valuable, but they may not cover all your protein-coding genes.
BTW, you focused on "Subject" coordinates, but I think you wanted to look at "Query" coordinates (your query DNA fragments). Yes, Myxozoans have introns.