Question

Converting assembled genomes (.fna) and predicted genes (.faa) to genbank or gff file?

0

Entering edit mode

6.0 years ago

bioguy ▴ 50

Anyone have a simple command line tool that can take 1) assembled bacterial genomes (fasta contigs) and 2) predicted gene (fasta ORFs) sequences from said genomes to generate either gff or genbank formatted files?

genome annotation genbank gff microbiology • 4.6k views

ADD COMMENT • link updated 5.8 years ago by Biostar 20 • written 6.0 years ago by bioguy ▴ 50

1

Entering edit mode

What you're asking for isn't really a 'conversion'. What you've described is annotation, which is an analysis task all of its own, unless you know definitively all your genes have specific matches to the genome of interest.

One of the best options (IMO) is prokka. If you have a list of proteins you already trust, you can pass that file to prokka and it will begin annotation from there.

ADD REPLY • link 6.0 years ago by Joe 22k

0

Entering edit mode

Right so that's kind of the issue – this output came from my own fork of prokka where I stripped out some of the extra stuff that, at the time, I didn't need. On top of that, for space, I had to delete gff files (I know that was a mistake, it's a long story).

Anyway, I now have a little side project that requires gff or gbk inputs, and because Ii'm a bit too lazy to rerun prokka a few hundred thousand times, I'm looking for way to turn what I have into what I need.

Sounds like I'll probably just need to redo it, though, if the predicted genes and assembled contigs can't be turned into a gff (which would make sense, as I guess getting gene coordinates would be night impossible).

ADD REPLY • link 6.0 years ago by bioguy ▴ 50

0

Entering edit mode

If you can provide protein sequences which cover the majority of the detected CDSs when running prokka, it should be able to run quite quickly as it normally iterates over the CDSs applying progressively looser and looser matches until all the CDSs have a match (or are otherwise "hypothetical proteins") - I would expect, at least.

I'm not aware of a simple 'lift over' tool myself. You could roll one with BioPython or something, but as you say, you either need coordinates or need to go through an alignment process, and if you have to realign hundreds of thousands of proteins to hundreds of thousands of genomes, you're probably just as well re-running prokka.

ADD REPLY • link 6.0 years ago by Joe 22k

0

Entering edit mode

Got it, makes sense. Thanks for the advice – I like that first idea, but honestly it'll probably be less of a headache to just rerun it.

ADD REPLY • link 6.0 years ago by bioguy ▴ 50

1

Entering edit mode

prokka does this using tbl2asn.

ADD REPLY • link 5.9 years ago by Mensur Dlakic ★ 29k

1

Entering edit mode

Word of warning, tbl2asn is a major pain, requiring frequent updating, and will likely be removed in subsequent versions according to Torsten, so I wouldn't advise coming to rely on it.

ADD REPLY • link 5.9 years ago by Joe 22k