Entering edit mode
5.3 years ago
bioguy
▴
50
Anyone have a simple command line tool that can take 1) assembled bacterial genomes (fasta contigs) and 2) predicted gene (fasta ORFs) sequences from said genomes to generate either gff or genbank formatted files?
What you're asking for isn't really a 'conversion'. What you've described is annotation, which is an analysis task all of its own, unless you know definitively all your genes have specific matches to the genome of interest.
One of the best options (IMO) is
prokka
. If you have a list of proteins you already trust, you can pass that file toprokka
and it will begin annotation from there.Right so that's kind of the issue – this output came from my own fork of prokka where I stripped out some of the extra stuff that, at the time, I didn't need. On top of that, for space, I had to delete gff files (I know that was a mistake, it's a long story).
Anyway, I now have a little side project that requires gff or gbk inputs, and because Ii'm a bit too lazy to rerun prokka a few hundred thousand times, I'm looking for way to turn what I have into what I need.
Sounds like I'll probably just need to redo it, though, if the predicted genes and assembled contigs can't be turned into a gff (which would make sense, as I guess getting gene coordinates would be night impossible).
If you can provide protein sequences which cover the majority of the detected CDSs when running prokka, it should be able to run quite quickly as it normally iterates over the CDSs applying progressively looser and looser matches until all the CDSs have a match (or are otherwise "hypothetical proteins") - I would expect, at least.
I'm not aware of a simple 'lift over' tool myself. You could roll one with BioPython or something, but as you say, you either need coordinates or need to go through an alignment process, and if you have to realign hundreds of thousands of proteins to hundreds of thousands of genomes, you're probably just as well re-running prokka.
Got it, makes sense. Thanks for the advice – I like that first idea, but honestly it'll probably be less of a headache to just rerun it.
prokka does this using tbl2asn.
Word of warning,
tbl2asn
is a major pain, requiring frequent updating, and will likely be removed in subsequent versions according to Torsten, so I wouldn't advise coming to rely on it.