Question

Gff3 + Fasta To Genbank (Augustus Training Set)

12

Entering edit mode

14.6 years ago

Darked89 4.7k

I got gff3 files (PASA ESTs mapping, exonerate protein2genome mapping) and would like to use it as an Augustus training set. Problem is, Augustus requires Genbank format. Before starting to write my own converter: are there any available already one can recommend?

gff genbank conversion • 34k views

ADD COMMENT • link updated 14.6 years ago by Gvj ▴ 470 • written 14.6 years ago by Darked89 4.7k

Ram · Answer 1 · 2010-09-16

9

Entering edit mode

14.6 years ago

Brad Chapman 9.7k

This example script uses Biopython and the in-development python GFF parser. This is essentially the same approach as BioPerl: GFF + Fasta -> generic SeqFeatures -> GenBank. Depending on how nicely the GFF is formatted, this should at least give you a close representation that can be post-processed to exactly what you need:

"""Convert a GFF and associated FASTA file into GenBank format.

Usage:
gff_to_genbank.py <GFF annotation file> <FASTA sequence file>
"""
import sys
import os

from Bio import SeqIO
from Bio.Alphabet import generic_dna
from BCBio import GFF

def main(gff_file, fasta_file):
    out_file = "%s.gb" % os.path.splitext(gff_file)[0]
    fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
    gff_iter = GFF.parse(gff_file, fasta_input)
    SeqIO.write(gff_iter, out_file, "genbank")

if __name__ == "__main__":
    main(*sys.argv[1:])

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.6 years ago by Brad Chapman 9.7k

0

Entering edit mode

Many thanks, you saved my day. Works like a charm [Python 2.6.4 + Biopython 1.55] for producing Genbank file. Did not work with Biopython 1.53. I will check how Augustus likes it and keep you posted.

ADD REPLY • link 14.6 years ago by Darked89 4.7k

0

Entering edit mode

PASA maps ESTs to a whole genome (many contigs). One can limit the size of the file by selecting as fasta input only these contigs to which there is a PASA match. awk/uniq + biopython did the work for me. Going further would be to create some mini-genes containing the matches + some flanks, but that may or may not be desired for Augustus training.

ADD REPLY • link 14.6 years ago by Darked89 4.7k

0

Entering edit mode

Awesome, glad that worked. Yes, this will require a recent Biopython as the code passes filenames directly to SeqIO; it's a good idea to update anyways to get the latest fixes. Limiting the GFF and fasta files is the right way to go. This is a memory hungry implementation. With some care and GFF files organized by record names, you could build an iterated version that works one record at a time.

ADD REPLY • link 14.6 years ago by Brad Chapman 9.7k

0

Entering edit mode

AUGUSTUS complains about no 'source' info in Genbank file. Fixed the numbers using biopython. Next had to replace SOURCE with source (Augustus bug), finally one has to have 'CDS' or 'mRNA' for training. Replacing 'cDNA_match' with 'mRNA' does not fix it. Apparently it requires join( (1083..1379, 1503..1595,1865..1930) entry.

ADD REPLY • link 14.6 years ago by Darked89 4.7k

0

Entering edit mode

Darek -- yes, it will just translate names directly from the original GFF file, so you may have to adjust if they don't match what Augustus wants. It should build join entries if the GFF has nested Parent/Child features. If you post a GFF and fasta example that should help to see what is going on.

ADD REPLY • link 14.6 years ago by Brad Chapman 9.7k

0

Entering edit mode

Sorry for not answering sooner: viruses got me. Problem is, at this stage all sequence data is still confidential, so I will rerun PASA with some known species next week. While there is a working script in AUGUSTUS (see gvj answer), there is still a place for python solution.

ADD REPLY • link 14.6 years ago by Darked89 4.7k

0

Entering edit mode

i think this would be a good script to put on the example gff parse page for biopython: http://biopython.org/wiki/GFF_Parsing

ADD REPLY • link 13.1 years ago by Zach Powers ▴ 340

0

Entering edit mode

Zach, a generalized version of this script is included with the GFF library: https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py If you'd like to add details to the wiki documentation that would be very welcome. Thanks so much for the feedback.

ADD REPLY • link 13.1 years ago by Brad Chapman 9.7k

0

Entering edit mode

AUGUSTUS needs a non-protein-coding flanking region around the gene in the genbank file. I don't see this being produced in the python script.

ADD REPLY • link 12.6 years ago by katharina.hoff ▴ 70

score 4 · Answer 2 · 2010-09-23

4

Entering edit mode

14.6 years ago

Gvj ▴ 470

Augustus comes with a script for this gff2gbSmallDNA.pl). you can find it under ~Augustus/script/

ADD COMMENT • link 14.6 years ago by Gvj ▴ 470

Ram · Answer 3 · 2010-09-16

I can't suggest a finished solution, but Bioperl has some components that may help.

There is a GFF3 parser in Bio::Tools::GFF that will allow creation of Bio::SeqFeatureI from GFF3. You'd still need to map the SO terms in the GFF to Genbank.

Bio::SeqFeature::Tools::TypeMapper provides a mapper the other way (Genbank to GFF3), but it can be given a custom mapping. You can probably invert its own mapping and feed it back to it, to make a mapper that goes from SO terms to Genbank feature types. Afterwards, you can use Bio::SeqIO to output Genbank.

score 3 · Answer 4 · 2010-09-16

3

Entering edit mode

14.6 years ago

Chase Miller ▴ 410

Here's an attempt at the solution from the BioPerl mailing list

http://bioperl.org/pipermail/bioperl-l/2005-February/018208.html

ADD COMMENT • link 14.6 years ago by Chase Miller ▴ 410