Hi all!
I was annotating bacteria genome with prokka. At the end It gave me a results, which are not very understood for me. Maybe somebody more familiar with this program will help?
I have multiple contigs assigned to the same annotation. I run this command:
./prokka --outdir contigs_prokka --kingdom Bacteria --genus X --proteins uniprot_bacteria.fasta --usegenus --evalue 0.01 --rfam --cpu 8 --norrna contigs.fasta &
As a result I have tsv file with annotation including list of contigs and its annotation. For some of results I see that multiple contigs are assigned to the same annotation. For example:
contig1 CDS 1965 Zinc-transporting ATPase OX=224308 GN=zosA PE=1 SV=1
contig2 CDS 918 Zinc-transporting ATPase OX=224308 GN=zosA PE=1 SV=1
I am not sure how to interprate this:
- whether it's unconnected contigs?
whether one sequence presents gene and the rest are pseudogenes?
can I take one - the longest - for final annotation and ignore rest, or annotate as potential pseudogenes?
Many thanks for any suggestions. Agata
Both could be real and just happen to be Zinc-t ATPases. Did you check for sequence redundancy in your contigs before running prokka. e.g. contig2 could be entirely similar to contig1 (and contained within it).
Yes, I used CD-HIT, it resulted in 10905 clusters from 10942 contigs.
This is not a single case, most records are multiplied.