Dear all, Please suggest which tool should i use for optimal assembly of bacterial fastq reads. Which approach is good if reference genome is available: de novo assembly or reference-guided de novo assembly? And also do spades assist with reference guided assembly?
Also I read threads about --trusted-contigs parameter of spades but could not understand clearly because people say that it merges assembly not guide assembly? Isn't the statement from spades assembler that "trusted contigs will be used for graph construction, gap closure and repeat resolution", tells us that it guides assembly?
Many thanks!
Thank you so much Brian Bushnell for such a detailed answer! I have to perform annotation and to construct a phylogenetic tree for a total of 19 samples. For variant calling i used snippy on fastq reads. Do you mean i have to use reconstructed genome for variant calling of bacterial WGS reads instead of reference genome? Please give your valuable suggestions.
And yes i experimented spades with --trusted-contigs flag too :) I used reference bacterial genome (fasta file). There are other tools too like Unicycler, abyss. Which tool do you prefer for bacterial reads assembly?
Here is my Quast report with --trusted contigs option:
Spades assembly without reference option:
Also denovo assembly through abyss
Assembly through unicycler
Please suggest that which is the better assembly approach here.
For bacterial genome assembly, JGI (where I work) strictly uses Spades in pure denovo mode. Of course we assemble all kinds of bacteria that are closely related to others, and we have whole plates of bacteria that are 99% ANI to each other, but we still just use Spades in pure denovo mode because 1) it would take a huge amount of manual effort to figure out which ones have the same structure and thus could use each other's contigs for scaffolding and 2) it's just not safe to use organisms to scaffold other organisms unless you are absolutely confident that they have no structural variations, and we never know. If you compare two organisms and they are 99.9% ANI then it's pretty likely that they have no large-scale SVs. There's no guarantee, but in that case I'd definitely try a reference-guided assembly instead of de-novo. Then you can map reads and call variants, and if you get lots of variants closely-spaced in certain regions... that indicates a structural variation and you need to abandon the reference-guided assembly approach.
JGI often has plates of bacteria that have 99% ANI to each other, but who knows if the 1% difference is random SNPs or some big structural variation. So we don't do reference-guided assembly since it doesn't work on a large scale. But in individual cases, it can give you a much better assembly if you have a single-contig assembly that you just want to modify to reflect some SNPs or short indels.
For annotation, unless you have a specific pipeline you plan to use, I'd suggest:
Snippy seems like a neat tool and I am going to look into it, but as far as I can tell, it's basically a wrapper for freebayes which is a subpar variant-caller. If you want to properly call variations, I would recommend aligning your reads to the reference, and then calling variants from that using a traditional variant-caller. For example:
Then you get all the advantages of paired reads for properly mapping in repetitive areas, the ability to detect long indels, and accurate variant-calling. Of course you can use Snippy too, but I'd advise you to compare its output to other programs. My experience with FreeBayes was that it generated vast quantities of false-positives.
Thank you so much Brian Bushnell for such a detailed answer. I am new to this field can you share any helping material (papers or tutorials or pipelines) that you think, will help me to perform bacterial genomics analysis e.g., assembly in an appropriate way like when to use which approach as you discussed above.
Many thanks!
All of the samples are quality passed do i still need to perform filtering/cleaning? Which tool is best for filtering of bacterial WGS reads?