Question

De novo assembly misassembly correction

0

Entering edit mode

5.6 years ago

vicloren • 0

Dear all,

I have been reading you all for a while now, as I started to delve into the WGS world a few months ago. Most of my knowledge is self-taught, thanks to webpages like this, or acquired through a course or internship. I have seen several assembly posts but haven't found any that really helps me with my problem. I will try to give as much context as I can to help out.

I am interested in detecting a specific insertion sequence (or something similar to it) within a bunch of Illumina Reads (2x300bp long run in a Miseq) obtained from a Mycobacterium species. I have already processed the reads and ran a de novo assembly through Spades. I created a blastdb with the resulting contigs and "fished" my sequence using blast+. I then extracted my sequence of interest from the contigs file for further analysis.

My issue comes when I assess the quality of my assembly. I have used QUAST and used a reference sequence (approx. 5.2Mb) from the same subspecies of mycobacteria, which also seems to have this sequence. I think the reports look good until I get to the misassembly section:

misassemblies              48          
misassembled contigs       29          
Misassembled contigs length  2421296     
local misassemblies        23          
scaffold gap ext. mis.     0           
scaffold gap loc. mis.     0           
unaligned mis. contigs     2           
unaligned contigs          18 + 22 part
Unaligned length             492243      

Genome fraction (%)          94.955      
Duplication ratio            1.002       
N's per 100 kbp            0.00        
mismatches per 100 kbp     575.74      
indels per 100 kbp         15.81       
Largest alignment            200998      
Total aligned length         4930311

One of this misassemblies appears in my contig of interest. Half of the contig relocates to one side of the reference, whereas the other relocates to the other side. My sequence of interest falls within this second half and luckily it is not cut by the relocation. Since mycobacteria don't usually recombine I think this contig is an artifact, and I am therefore concerned about the other 28 contigs and how I can refer to my assembly in a future publication (I don't want to upload a bad quality assembly and want it to be the best possible version).

I have tried increasing k-mer size from those set by default in Spades but got similar results so I wonder if this is just a limitation set by using short reads for de novo assembly or if there is any way of improving the misassemblies without having to resequence using long read technologies.

Thank you very much for your help!

Kind regards,

Assembly next-gen genome • 1.7k views

ADD COMMENT • link updated 5.6 years ago by Ram 44k • written 5.6 years ago by vicloren • 0

0

Entering edit mode

I'm having this problem myself. Did you figure out what the problem was?

ADD REPLY • link 4.1 years ago by pcvgt • 0