Hello,
I'm working with two Illumina whole-genome libraries for the same species, let's call them A (TruSeq Nano) and B (TruSeq PCR-free), both with 150 bp reads and 350 bp inserts. My pipeline to obtain draft assemblies:
- QC with BBtools
- SPAdes assembly
- Contaminant contig removal with BlobTools
- SPAdes assembly from reads mapping to target contigs
- Second round of contaminant contig removal with BlobTools
- Redundans on the remaining contigs to obtain haploid genome representation
Assembly A had ~3000 contigs and N50 of ~100 kb, assembly B had ~8000 contigs and N50 of 40 kb. Both were of similar total length (~100 Mb). However, when I mapped QC reads against them, A had read coverage of 24 and B of 120 (after excluding repeat regions).
How is it possible that a library with much lower sequencing depth of the target genome gives a more contiguous assembly? Would you be able to suggest strategies to investigate?
Interesting. How were those coverage depths (24 and 120) calculated (mean, median, ...)? And what determined the repeat regions? I wonder if you have more of an apples-to-oranges comparison than you expect, and it just isn't obvious with a couple summary values like these.
What I think I'd look into to investigate, roughly in order:
Thanks for your suggestions. Any tool for comparing the assemblies at scale? Seems a difficult task with graphical viewers as they are highly fragmented - could maybe focus on just a few contigs.
Nothing I'm aware of, but I only know some basics in this area and I wouldn't be surprised if there is a tool for just this sort of thing. Personally I'd probably cobble something together with Biopython or even just BLAST... but if there is an existing genome reference you can use here, probably easier to just compare both sets (or just map the reads themselves) to that reference for both A and B. (Are you sure the reads you have in each set really look like that species you're expecting?)