Entering edit mode
6.5 years ago
rororo
▴
10
I want to predict genes from my organism of interest using augustus
. I therefore have my assembly, as well as a gff3 file.
I want to train augustus using my dataset using etraining
. I therefore need a genbank file.
I found the following script here to convert gff+fasta to genbank, since the provided gff2Smallgb.pl
from augustus does not work:
#!/usr/bin/env python
"""Convert a GFF and associated FASTA file into GenBank format.
Usage:
gff_to_genbank.py <GFF annotation file> <FASTA sequence file>
"""
from __future__ import print_function
import sys
import os
from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio import Seq
from BCBio import GFF
def main(gff_file, fasta_file):
out_file = "%s.gb" % os.path.splitext(gff_file)[0]
fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
gff_iter = GFF.parse(gff_file, fasta_input)
SeqIO.write(_check_gff(_fix_ncbi_id(gff_iter)), out_file, "genbank")
def _fix_ncbi_id(fasta_iter):
"""GenBank identifiers can only be 16 characters; try to shorten NCBI.
"""
for rec in fasta_iter:
if len(rec.name) > 16 and rec.name.find("|") > 0:
new_id = [x for x in rec.name.split("|") if x][-1]
print("Warning: shortening NCBI name %s to %s" % (rec.id, new_id))
rec.id = new_id
rec.name = new_id
yield rec
def _check_gff(gff_iterator):
"""Check GFF files before feeding to SeqIO to be sure they have sequences.
"""
for rec in gff_iterator:
if isinstance(rec.seq, Seq.UnknownSeq):
print("Warning: FASTA sequence not found for '%s' in GFF file" % (
rec.id))
rec.seq.alphabet = generic_dna
yield _flatten_features(rec)
def _flatten_features(rec):
"""Make sub_features in an input rec flat for output.
GenBank does not handle nested features, so we want to make
everything top level.
"""
out = []
for f in rec.features:
cur = [f]
while len(cur) > 0:
nextf = []
for curf in cur:
out.append(curf)
if len(curf.sub_features) > 0:
nextf.extend(curf.sub_features)
cur = nextf
rec.features = out
return rec
if __name__ == "__main__":
main(*sys.argv[1:])
Unfortunately, the feature source
is not printed, so that augustus etraining
logs the following error:
GBProcessor::getGeneList(): Sequence has 0 length. Maybe 'source' Feature missing?
A model genbank file is available here: NCBI genbank format
Isn't the
source
field originating in GFF file? GFF format says this:So try something you like as source name in your GFF.
source
in theFEATURES
tableThen you will need to create that yourself. If you look in the model genbank page you linked above they provide some indication of what can go in the SOURCE section.
I have 25,000 entries. So I need an automatic way!
If you wrote the script above then you should be able to do that, correct?
sorry, I clearified the source!
seqret can be used to convert gff+fasta to Genbank?
Another alternative would be to open fasta file in Artemis -> Read GFF3 as an entry within it and save the whole thing as genbank file.
thanks for your help. Unfortunately, seqret does not stop converting - the file just gets bigger and bigger. Artemis freezes when opening the gff-file
Gff3 + Fasta To Genbank (Augustus Training Set)
Script OP has referenced to came from a link in an answer from this thread.
I see! Thanks genomax. @rororo have you tried the conversion file provided with Augustus as mentioned in the thread above? https://github.com/nextgenusfs/augustus/blob/master/scripts/gff2gbSmallDNA.pl
Not sure if that is the same script but this is in the original thread.
I think the original script is: https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py , not sure if OP has tried the script provided with Augustus.
this is the script I posted above
If you have 25,000 entries then isn't that expected? You need to be patient at times. It may take some time to complete the conversion.
as soon as the 25,000 entries are read and returned, seqret starts from the beginning and returns the same sequences again and again