I got gff3 files (PASA ESTs mapping, exonerate protein2genome mapping) and would like to use it as an Augustus training set. Problem is, Augustus requires Genbank format. Before starting to write my own converter: are there any available already one can recommend?
Many thanks, you saved my day. Works like a charm [Python 2.6.4 + Biopython 1.55] for producing Genbank file. Did not work with Biopython 1.53. I will check how Augustus likes it and keep you posted.
PASA maps ESTs to a whole genome (many contigs). One can limit the size of the file by selecting as fasta input only these contigs to which there is a PASA match. awk/uniq + biopython did the work for me. Going further would be to create some mini-genes containing the matches + some flanks, but that may or may not be desired for Augustus training.
Awesome, glad that worked. Yes, this will require a recent Biopython as the code passes filenames directly to SeqIO; it's a good idea to update anyways to get the latest fixes. Limiting the GFF and fasta files is the right way to go. This is a memory hungry implementation. With some care and GFF files organized by record names, you could build an iterated version that works one record at a time.
AUGUSTUS complains about no 'source' info in Genbank file. Fixed the numbers using biopython. Next had to replace SOURCE with source (Augustus bug), finally one has to have 'CDS' or 'mRNA' for training. Replacing 'cDNA_match' with 'mRNA' does not fix it. Apparently it requires join( (1083..1379, 1503..1595,1865..1930) entry.
Darek -- yes, it will just translate names directly from the original GFF file, so you may have to adjust if they don't match what Augustus wants. It should build join entries if the GFF has nested Parent/Child features. If you post a GFF and fasta example that should help to see what is going on.
Sorry for not answering sooner: viruses got me. Problem is, at this stage all sequence data is still confidential, so I will rerun PASA with some known species next week. While there is a working script in AUGUSTUS (see gvj answer), there is still a place for python solution.
i think this would be a good script to put on the example gff parse page for biopython: http://biopython.org/wiki/GFF_Parsing
Zach, a generalized version of this script is included with the GFF library: https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py If you'd like to add details to the wiki documentation that would be very welcome. Thanks so much for the feedback.
AUGUSTUS needs a non-protein-coding flanking region around the gene in the genbank file. I don't see this being produced in the python script.