Hello Briefly , I work on the pangenome of a bacterium that secretes two types of toxins : Toxin A and Toxin B, I m counting on doing Bioinformatics analysis of several strains (145) in order to deduce the mutations in all of them and to know the most severe strains based on the SNP on the genes coding for the two tonxins , in the very begining i have Raw Data (reads ) the first thing i did was assembly (by Spades) and afetr i did annotation (using Prokka) of all the strains (145). after extraction and analysis of the sequences of the genes from ffn file and I noticed that in some genomes the toxins sequences are not completed or fragmented sequences ( after comparaision of the sequence legth with the same sequence on NCBI or with other sequence of other strains )and i don't know is because there's problem in my Raw data or in the assembly step because of bad annotation !! please if anyone have any idea can help to improve every step or any others idea or step i can do to achieve my objective , it ll be great help Thank you, Idea Committee.
You need to validate your assemblies, at least for the region you are interesed, you cannot trust that the assembler produced the full genome in one shot without errors.
Agreed^.
You may need to play with your data and alter assembly parameters (see
shovil
by Torsten Seemann, theprokka
author). You can also provide a database of 'trusted proteins' toprokka
, so in caseprodigal
(or another part of theprokka
pipeline) is failing to correctly call the CDS even if the assembly is OK, you might be able to improve by using a starting set of proteins from a reference genome.Lastly, be open minded that these might also be legitimate CDS breaks (introduction of frameshifts/stop codons etc.). To really scrutinise how much you trust the bases called in the toxin region, visualise the bam/pileup files with something like
Tablet
.Thanks for the exchange, but how can I provide a database of 'trusted proteins' to prokka ?
Its a commandline option. Take a look at the documentation (hint
--proteins
).