Using de novo assembly, I've managed to generate the sequence assembly of my strain with genome size about 147MB. But I've noticed the genome size of the reference strain was around 55MB, meaning that my assembly size was 3-folds much bigger than the reference strain.
So I was wondering is there any way or software suggestion for me to check if there are contaminants or clonal variants in my contig assembly?
Thanks!
P/s: I've also tried using reference-guided assembly and end up having my genome size at 51MB (60% reads mapped). Tried checking the taxonomy of the unmapped reads but only 24% of them were classified and 76% of them were unclassified. So mostly the identity of the unmapped reads was still unknown.
It would be much easier to help you if you provided more details. A strain of what? What reads did you have and how were they assembled? What is the depth of coverage?
What is the assembly size when you throw out contigs smaller than 2K, 3K, 5K? You may already have a proper genome size once you throw out the small stuff.
An Eimeria tenella strain. It was illumina paired-end DNA reads with depth of coverage at 80x. With de novo assembly, MEGAHIT and SSPACE were used, both with default parameters. While reference-guided assembly, BWA MEM was used.
I'm not sure about throwing out the contigs smaller than 2k, 3k and 5k, because the average contig size was only 1402.