Hello everybody,
I've got a couple of questions using Prokka.
1) anybody come across the problem of annotating small ORFs? Lots of operon leader peptides etc remain un-annotated. I understand that's to reduce the false positives, but I still would want to annotate these genes.
2) how does one compile a good genus-specific database? E.g. if I need a reference protein set for Salmonella, what is a good strategy?
Thank you in advance.
Thank you. I see now that some of the small peptides are annotated with --rfam option that generates candidate ncRNAs, which is also useful. What is the option to include the sig_peptide? There's nothing in the manual, and they are not generated by default.
About the genus-specific reference: how would you pick the gbk files you want to use? And is there any way to generate the gene name (I mean common name, like trpA) in any reliable fashion?
Yeah, prokka invokes a number of optional 3rd party applications, and SignalP is one of them. I can't see the specific flag in the docs, but the github page mentions it. You'll no doubt need SignalP installed and in the path though eitherway.
I would just use which ever genomes you trust as a reference and download the GBK from NCBI. I can't really tell you what reference to use. The option is mainly to allow people who have their own custom annotated genomes to include additional features that they might have added by hand relative to the NCBI reference etc. You don't need to do this at all though, if you don't have one you're bothered with. Prokka calls CDSs with prodigal, and then blasts/searches them all against the databases already so you should get the common salmonella annotations. If you don't have custom proteins etc then I wouldnt worry about it. if its a gene with a common gene name in NCBI, it will be picked up by prokka, assuming the variant your sequence has is similar enough to it. Anything prokka can't identify it will call a
hypothetical_protein
Ok, I had to grep through the source code to understand it. SignalP is activated when you're using --gram option. I don't think it's documented anywhere. Anyhow, seems to be working nicely.
Thank you for all the tips again.
Ah good you found the same, I was just about to post the same point!
P.S. be sure to accept one or more answers if you got the answer you needed.