Question

How to check genome assemblies contaminants

0

Entering edit mode

2.8 years ago

Ak ▴ 60

Using de novo assembly, I've managed to generate the sequence assembly of my strain with genome size about 147MB. But I've noticed the genome size of the reference strain was around 55MB, meaning that my assembly size was 3-folds much bigger than the reference strain.

So I was wondering is there any way or software suggestion for me to check if there are contaminants or clonal variants in my contig assembly?

Thanks!

P/s: I've also tried using reference-guided assembly and end up having my genome size at 51MB (60% reads mapped). Tried checking the taxonomy of the unmapped reads but only 24% of them were classified and 76% of them were unclassified. So mostly the identity of the unmapped reads was still unknown.

assembly genome contaminants • 1.4k views

ADD COMMENT • link 2.8 years ago by Ak ▴ 60

1

Entering edit mode

It would be much easier to help you if you provided more details. A strain of what? What reads did you have and how were they assembled? What is the depth of coverage?

What is the assembly size when you throw out contigs smaller than 2K, 3K, 5K? You may already have a proper genome size once you throw out the small stuff.

ADD REPLY • link 2.8 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

An Eimeria tenella strain. It was illumina paired-end DNA reads with depth of coverage at 80x. With de novo assembly, MEGAHIT and SSPACE were used, both with default parameters. While reference-guided assembly, BWA MEM was used.

I'm not sure about throwing out the contigs smaller than 2k, 3k and 5k, because the average contig size was only 1402.

ADD REPLY • link 2.8 years ago by Ak ▴ 60

score 0 · Answer 1 · 2022-02-25

0

Entering edit mode

2.8 years ago

colindaven 7.0k

If your average contig size is only 1400bp, then how long is the average gene ? An assembly is not useful if the average gene is fragmented, i.e. spread across multiple contigs. It would be very, very helpful to add long read data. Even one nanopore flowcell of data would be a game changer here.

On your assembly, you can detect contamination using

blastn -local
blastx -local
detect ORFs then use blastp etc

My feeling is that you do have contamination here, the assembly should not be that much bigger than expected.

You can also check assembly quality with BUSCO to check multi-copy genes.

ADD COMMENT • link 2.8 years ago by colindaven 7.0k

0

Entering edit mode

I'm not sure about the overall average gene in the assembly because I'm only particularly interested in 87 genes. So for the 87 genes the average length would be around 600bp. And yes, I did encountered some of the genes that were spread across 2 contigs, causing only partial sequence of the genes can be obtained.

Can you elaborate further on how do I proceed with blastn or blastx? Because as I've mentioned, I've tried doing reference-guided assembly instead and have extracted the unmapped reads to find out what were they. But most of the reads were unclassified. I did blastn and blastx on these unclassified reads but no significant similarity were found.

BUSCO result for the de novo assembly was actually quite good:

C:90.1%[S:87.4%,D:2.7%],F:5.6%,M:4.3%,n:446
402 Complete BUSCOs (C)
390 Complete and single-copy BUSCOs (S)
12 Complete and duplicated BUSCOs (D)
25 Fragmented BUSCOs (F)
19 Missing BUSCOs (M)
446 Total BUSCO groups searched

ADD REPLY • link 2.8 years ago by Ak ▴ 60