We are sequencing an animal genome and the produced GFF file is version 2. However, I learned that GFF2 is now deprecated and GFF3 is a better choice. So I want a GFF3 file for the genome. Since I am not in charge of GFF production, I want to ask some questions here before submitting my request to the research group.
My questions are (1)How is GFF3 file produced? Is it difficult to get GFF3? Does it take much effort? (2) What softwares are available that support GFF3? GMOD mentions that Apollo, Chado, CMap and GBrowse support GFF3. Softwares based on Chado such as Artemis and ACT also support. Any more?
conversion from another format using an existing software library (e.g. Bioperl's bp_genbank2gff3.pl utility)
writing your own code to parse suitable input data and write out GFF3
Is it difficult to convert GFF2 to GFF3? GMOD describe it as problematic. It should not be too difficult if you use appropriate input data and can write scripts to parse and rewrite text files.
Does it take much effort? Some of the fields are relatively easy to generate from other files (chromosome names, start/end positions); others are a little more difficult - for example, GFF3 should use accepted Sequence Ontology terms. A good starting point for input data is something like the UCSC genome browser MySQL tables.
Since the genome is newly sequenced, the first method won't work. Probably the most urgent work is to annotate the genomic sequences through querying different databases and then parse the annotation files to extract suitable fields to build the GFF3 file. Refer to this post.
ADD REPLY
• link
updated 5.3 years ago by
Ram
44k
•
written 13.8 years ago by
Dejian
★
1.3k
0
Entering edit mode
That's great! But GenBank Flat File Format seems not easy to produce locally. Any tools to facilitate producing GenBank format files?
Well, I'm not suggesting that you generate genbank first, just that genbank->GFF is one way to make GFF3. Since you state that you already have GFF2, I'd suggest that is the sensible starting point.
Typically creating GFF3 is not that hard; several gene prediction programs will create it automatically, and MAKER, an easy to set up gene annotation pipeline (http://gmod.org/wiki/MAKER) will produce GFF3 for all of its outputs. Since you're starting a new genome, that is definitely something I would suggest investigating.
Two more items that might be of use:
The GFF3 specification, with several examples of what proper GFF3 should look like:
In terms of converting GFF2 to GFF3, it is problematic to solve in a general sense: it is hard to make a tool that will take any GFF2 and reliably convert it to GFF3, because of the crazy variability in what people call GFF2 (the specification was very loose). However, for a given GFF2 file, converting it to GFF3 can be fairly easy if you have even a relatively small amount of programming ability. For some common formats, it's even fairly easy to find converters. For example, I know of a converter that works quite well for JGI GFF2.
MAKER is really a good pipeline. But I'm not sure whether it can meet our needs since it claims that MAKER is ideal for smaller projects[1] while our genome is really large - around 7G. But I'd like to see into it. I've downloaded the package.
ADD REPLY
• link
updated 5.3 years ago by
Ram
44k
•
written 13.8 years ago by
Dejian
★
1.3k
0
Entering edit mode
I think the authors of MAKER were making the point there that it will work nicely for small projects as well. For larger projects, MAKER is still good, you'll probably just want to have a cluster to do the analysis rather than running it on a laptop.
you may retrieve the information you need (just the how to, not a software tool to do it) from GMOD's wiki too, at the end of the GFF2 format description.
Yes, I saw it. But "Converting a file from GFF2 to GFF3 format is problematic" and "GMOD does not endorse (or disparage) any particular converter." I just want to get GFF3 version, not nessarily converted from GFF2. Possibly the title is a bit missleading. I will change it.Thank you all the same.
The answer also depends on just how much and what type of information is contained in your files.
Perhaps edit the question to make it clearer; "How is GFF3 file produced?" is not the same as "how do I make GFF3 from the data that I have."