I recently ran Prokka on a .fasta file of ordered contig DNA sequences. The file looks like this:
>NODE_6_length_103505_cov_20.694918
agcgattagcccccagcgattagcgaaaagcgattttttattagcgattagcgattagcg
attagcgattagcgattttagaactatttgcctaaatttctgcttaaatatctacagaaa
The annotation went (fairly) well, but I would like to rerun the program on the correlated amino acid sequence ordered contigs. The Prokka vignette resource online states:
"If you have Genbank or Protein FASTA file(s) that you want to annotate genes from as the first priority, use the --proteins myfile.gbk. Please make sure it has a recognisable file extension like .gb or .gbk or auto-detect will fail."
This may be a basic question, but how can I confidently translate my ordered DNA sequence .fasta file into ordered protein sequence .gb or .gbk file? I am hoping to find a straightforward solution (I have basic Linux and R skills).
It seemed to me that what you need is only a reference genome in the format of gbk (and of high quality particularly if there are /gene and /EC_number) so that the naming would be consistent.
Thanks @SishuoWang. I am hoping to input the unknown genome itself in protein sequence. Do you think this is possible in Prokka? Or, is it only possible to supply a reference genome in protein sequence (along with the unknown genome in DNA sequence)? I may be misunderstanding the Prokka vignette
What does this mean?
It sounds to me like you want to re-run the analysis, using the annotation you just created as some kind of reference?
Contig ordering is irrelevant as far as
prokka
is concerned.Thanks @Joe. I had ran Prokka genome annotation analysis using the input of the genome I want annotated in the .fasta format of DNA sequence of contigs (with an example of the format shown in my original post). However, I wanted to convert that .fasta format input into its corresponding amino acid sequence (so an amino acid sequence of contigs) and then use that input instead to be annotated by Prokka. Is this possible? It seems so in the vignette:
"If you have Genbank or Protein FASTA file(s) that you want to annotate genes from as the first priority, use the --proteins myfile.gbk. Please make sure it has a recognisable file extension like .gb or .gbk or auto-detect will fail."
I have been having a hard time converting my new input file from the original DNA sequence contig fasta file into its corresponding Protein Genbank/FASTA file that the vignette calls for. Here is what I tried so far:
1) Converted the DNA sequence to amino acid sequence using emboss_transeq. This gave me a protein sequence file in .fasta format. 2) Converted the protein sequence file in .fasta format to .gbk format using fasta_to_genbank. This gave me a protein sequence file in .genbank format that looked like as follows:
I noticed, for some reason, along this conversion process, I lost the first contig. Nonetheless, when I input this file into Prokka using the command:
I receive the error:
So, I am unsure how to create the appropriate input file type needed for Prokka if I want to annotate genes as a first priority from a "Protein FASTA file(s)".
No, you're misunderstanding.
You don't need fully translated contigs. You just need a selection of protein sequences to act as a 'trusted database' from which Prokka will start. The intended use is, for example, if you study a weird bacteria without many genomes, but there already exists 1 reference sequence (say) which has many hand curated annotations etc.
You're recieving your error because you aren't supplying the 'blank'
contigs.fasta
which prokka acts on. This is usually the final positional argument if memory serves.Basically what you need is:
or
Where
referenceproteins
is one of a multifasta (of amino acid sequences with correctly formatted headers), OR a genbank of an existing annotated set of sequences. Yourcontigs.fasta
file should still be a DNA sequence file of contigs which form the basis to be annotated.It sounds like you've significantly munged your data, so I would start over if I were you.
NB - there's no requirement to use
--proteins
, you only need it if you know you have a set of proteins you trust over the annotations that might be picked up via HMMer/BLAST etc which prokka will run.