Converting assembled genomes (.fna) and predicted genes (.faa) to genbank or gff file?
0
0
Entering edit mode
5.3 years ago
bioguy ▴ 50

Anyone have a simple command line tool that can take 1) assembled bacterial genomes (fasta contigs) and 2) predicted gene (fasta ORFs) sequences from said genomes to generate either gff or genbank formatted files?

genome annotation genbank gff microbiology • 3.9k views
ADD COMMENT
1
Entering edit mode

What you're asking for isn't really a 'conversion'. What you've described is annotation, which is an analysis task all of its own, unless you know definitively all your genes have specific matches to the genome of interest.

One of the best options (IMO) is prokka. If you have a list of proteins you already trust, you can pass that file to prokka and it will begin annotation from there.

ADD REPLY
0
Entering edit mode

Right so that's kind of the issue – this output came from my own fork of prokka where I stripped out some of the extra stuff that, at the time, I didn't need. On top of that, for space, I had to delete gff files (I know that was a mistake, it's a long story).

Anyway, I now have a little side project that requires gff or gbk inputs, and because Ii'm a bit too lazy to rerun prokka a few hundred thousand times, I'm looking for way to turn what I have into what I need.

Sounds like I'll probably just need to redo it, though, if the predicted genes and assembled contigs can't be turned into a gff (which would make sense, as I guess getting gene coordinates would be night impossible).

ADD REPLY
0
Entering edit mode

If you can provide protein sequences which cover the majority of the detected CDSs when running prokka, it should be able to run quite quickly as it normally iterates over the CDSs applying progressively looser and looser matches until all the CDSs have a match (or are otherwise "hypothetical proteins") - I would expect, at least.

I'm not aware of a simple 'lift over' tool myself. You could roll one with BioPython or something, but as you say, you either need coordinates or need to go through an alignment process, and if you have to realign hundreds of thousands of proteins to hundreds of thousands of genomes, you're probably just as well re-running prokka.

ADD REPLY
0
Entering edit mode

Got it, makes sense. Thanks for the advice – I like that first idea, but honestly it'll probably be less of a headache to just rerun it.

ADD REPLY
1
Entering edit mode

prokka does this using tbl2asn.

ADD REPLY
1
Entering edit mode

Word of warning, tbl2asn is a major pain, requiring frequent updating, and will likely be removed in subsequent versions according to Torsten, so I wouldn't advise coming to rely on it.

ADD REPLY

Login before adding your answer.

Traffic: 1947 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6