Question

High quality genome assembly based on poor data with reference genome

0

Entering edit mode

6 weeks ago

Jason • 0

I now have illumina sequencing data and nanopore sequencing data of a mutant strain of Corynebacterium glutamicum, as well as its wild-type reference high quality whole genome.

Due to poor DNA quality of mutant strain, the N50 of the nanopore data is very low.

How can I obtain a circle plot (high quality whole genome of mutant) based on above data? which software or workflow you recommend?

Thanks

Description of ONT data:

General summary:
Mean read length:                 566.4
Mean read quality:                  9.3
Median read length:               435.0
Median read quality:                9.2
Number of reads:              169,528.0
Read length N50:                  635.0
STDEV read length:              1,933.0
Total bases:               96,018,294.0
Number, percentage and megabases of reads above quality cutoffs
>Q5:    169523 (100.0%) 96.0Mb
>Q7:    168539 (99.4%) 95.9Mb
>Q10:   39622 (23.4%) 30.7Mb
>Q12:   1153 (0.7%) 1.2Mb
>Q15:   6 (0.0%) 0.2Mb
Top 5 highest mean basecall quality scores and their read lengths
1:      21.9 (38688)
2:      19.7 (73373)
3:      19.5 (16502)
4:      17.4 (21180)
5:      16.0 (1654)
Top 5 longest reads and their mean basecall quality score
1:      527400 (8.8)
2:      469123 (8.5)
3:      107070 (13.1)
4:      98992 (8.9)
5:      84108 (8.4)

Genome assembly • 508 views

ADD COMMENT • link updated 6 weeks ago by Michael 55k • written 6 weeks ago by Jason • 0

1

Entering edit mode

So, in essence, you want software that turns bad-quality data into good-quality data?

ADD REPLY • link 6 weeks ago by Michael 55k

0

Entering edit mode

Hi Michael, thanks for your reply. I want to use poor ONT data + illumina data to assemble a whole genome sequence of mutant strain. Luckily, I have a whole reference genome of wild type strain. So, I was wondering if I can combine sequenced data and reference genome data to obtain a whole genome of mutant? Thanks :)

ADD REPLY • link 6 weeks ago by Jason • 0

0

Entering edit mode

I think I am possibly missing the point here, but to me the obvious answer is to try to get high-quality data in the first place.
Btw, I have worked in the same group that made (one of) the first C. glutamicum reference genome (in 2003). C. glutamicum is easy to culture that's why it's used in biotechnological processes. With the current ONT sequencing, you can create a genome assembly that is on par with or better than the reference:

take a life sample of the strain
culture it in the lab (using standard LB medium) this step is optional if you can get enough biomaterial
everyone who can culture e. coli can grow this (you need the medium, an old shaker, and an erlenmeyer flask)
extract high-quality HMW DNA
optimize the protocol
sequence
filter at Q20 (Q30 is possible with duplex reads)
get a good assembly

Possibly, the data is the result of a first sequencing attempt, but remember Q10 equals 10% error rate, combined with the shorter read lengths you are not getting much out of the data anyway. You won't get reliable SNVs or structural variation out of this, the reference won't help you much.

If there is no option to generate good-quality HMW DNA use only the Illumina sequences and run variant detection.

IF that cannot be done either, refuse.

ADD REPLY • link 6 weeks ago by Michael 55k

1

Entering edit mode

If you just want to compare your genomes, you can use any of a list of alignment tools. mauve of mummer are great for this, but if you want to visualise with something like circos I have previously used Satsuma2, the output of which can easily be adapted in circos input.

But @michael's point is valid. You're not going to improve your assembly if you have poor input data. Reference guided approaches come with a whole series of issues - there are plenty of discussions about that on this site if you search.

ADD REPLY • link 6 weeks ago by dthorbur ★ 2.6k

0

Entering edit mode

Thanks! it quite helpful. The big problem to me is the DNA quality for sequencing. Because I only access to the DNA with industrial treatment, normally, the N50 is around 600-700 bp. So is it impossible to get high-quality whole genome in circle based one these data? Could I improve it through accumulating the quantity of data (multiple sequencing of different DNA of mutant)?

ADD REPLY • link 6 weeks ago by Jason • 0