Question

Annotation with Prokka - small ORFs and genus-specific DB?

0

Entering edit mode

6.9 years ago

predeus ★ 2.1k

Hello everybody,

I've got a couple of questions using Prokka.

1) anybody come across the problem of annotating small ORFs? Lots of operon leader peptides etc remain un-annotated. I understand that's to reduce the false positives, but I still would want to annotate these genes.

2) how does one compile a good genus-specific database? E.g. if I need a reference protein set for Salmonella, what is a good strategy?

Thank you in advance.

prokka bacteria annotation • 3.5k views

ADD COMMENT • link updated 6.9 years ago by Asaf 10k • written 6.9 years ago by predeus ★ 2.1k

1

Entering edit mode

6.9 years ago

Asaf 10k

some thoughts:

You can try and predict all ORFs with EMBOSS transeq for instance and look for domains using interproscan - you might find some putative short ORFs that way.
I guess by downloading assemblies of a lot of salmonella genomes and extracting the genes but in this case, since most of them are predicted using prokka or similar tools, it won't help you. You can download some well studied Salmonella genomes from NCBI or UCSC genome browser. E. coli and Salmonella are very similar and its genome is well annotated so it might also be useful.

ADD COMMENT • link 6.9 years ago by Asaf 10k

0

Entering edit mode

Thank you. Well annotated references prove to have quite a lot of mistakes - so that makes it harder to use this strategy.

ADD REPLY • link 6.9 years ago by predeus ★ 2.1k

score 2 · Accepted Answer · 2018-01-30

2

Entering edit mode

6.9 years ago

Joe 21k

If leader peptides etc aren't commonly seen as specific separate ORFs I doubt they'd be annotated separated from their 'parent' ORF, though I see it supports a --sig_peptide option these days.

As for the databases, prokka supports a custom protein database, and for that you can follow the instructions here (https://github.com/tseemann/prokka/blob/master/README.md#databases)

Give the --sig_peptide flag a try and curate a selection of your own sequences (from genomes you trust) of interest and follow:

 prokka-genbank_to_fasta_db Coccus1.gbk Coccus2.gbk Coccus3.gbk Coccus4.gbk > Coccus.faa
 cd-hit -i Coccus.faa -o Coccus -T 0 -M 0 -g 1 -s 0.8 -c 0.9
 rm -fv Coccus.faa Coccus.bak.clstr Coccus.clstr
 makeblastdb -dbtype prot -in Coccus
 mv Coccus.p* /path/to/prokka/db/genus/

ADD COMMENT • link 6.9 years ago by Joe 21k

0

Entering edit mode

Thank you. I see now that some of the small peptides are annotated with --rfam option that generates candidate ncRNAs, which is also useful. What is the option to include the sig_peptide? There's nothing in the manual, and they are not generated by default.

About the genus-specific reference: how would you pick the gbk files you want to use? And is there any way to generate the gene name (I mean common name, like trpA) in any reliable fashion?

ADD REPLY • link 6.9 years ago by predeus ★ 2.1k

1

Entering edit mode

Yeah, prokka invokes a number of optional 3rd party applications, and SignalP is one of them. I can't see the specific flag in the docs, but the github page mentions it. You'll no doubt need SignalP installed and in the path though eitherway.

I would just use which ever genomes you trust as a reference and download the GBK from NCBI. I can't really tell you what reference to use. The option is mainly to allow people who have their own custom annotated genomes to include additional features that they might have added by hand relative to the NCBI reference etc. You don't need to do this at all though, if you don't have one you're bothered with. Prokka calls CDSs with prodigal, and then blasts/searches them all against the databases already so you should get the common salmonella annotations. If you don't have custom proteins etc then I wouldnt worry about it. if its a gene with a common gene name in NCBI, it will be picked up by prokka, assuming the variant your sequence has is similar enough to it. Anything prokka can't identify it will call a hypothetical_protein

ADD REPLY • link 6.9 years ago by Joe 21k

1

Entering edit mode

Ok, I had to grep through the source code to understand it. SignalP is activated when you're using --gram option. I don't think it's documented anywhere. Anyhow, seems to be working nicely.

Thank you for all the tips again.

ADD REPLY • link 6.9 years ago by predeus ★ 2.1k

0

Entering edit mode

Ah good you found the same, I was just about to post the same point!

P.S. be sure to accept one or more answers if you got the answer you needed.

ADD REPLY • link 6.9 years ago by Joe 21k