Hi all, I am using OpGen data, which I would say is a really fantastic aid for genome assembly. Below are a few data points (334 sites total). Using these data points, I was able to discover misassemblies from my automated assembly tools (e.g. Newbler). My overall question is, how can I automate my assembly using these high-quality data points?
My immediate solution is to artificially convert these sites into 6-mer paired end reads. For example the first data point below describes a restriction fragment that is 14867 bp. In other words there are two NheI sites 14867 bp away from each other. So, my immediate question is, how can I convert these sites into paired end reads? What is a paired end read file format that Newbler would accept? The restriction site is G^CTAGC.
Thank you for your help.
<RESTRICTION_MAP ID="XYZ" ENZYME="NheI" INSILICO="false">
<MAP_DISPLAY DBID="13" EDITABLE="false" STICK="false" X="10000" Y="149" TRANS="255" ORDER="1320" ORIENTATION="1" CIRCULAR="true" GROUPID="-1" />
<FRAGMENTS SHIFT="0" OFFSET="1">
<F I="0" S="14867" STDDEV="0.000" HIGHLIGHT="false" HIDE="false" GAP="false" />
<F I="1" S="7731" STDDEV="0.000" HIGHLIGHT="false" HIDE="false" GAP="false" />
<F I="2" S="9070" STDDEV="0.000" HIGHLIGHT="false" HIDE="false" GAP="false" />
<F I="3" S="2016" STDDEV="0.000" HIGHLIGHT="false" HIDE="false" GAP="false" />
<F I="4" S="3175" STDDEV="0.000" HIGHLIGHT="false" HIDE="false" GAP="false" />
<F I="5" S="5418" STDDEV="0.000" HIGHLIGHT="false" HIDE="false" GAP="false" />
</FRAGMENTS>
<MAP_METRICS STRETCH="0" RECT_AVE="0.00" RECT_ALL="0.00" MID_STDDEV="0.00" R="0.00" WIGGLE="0.00" GAP_STDDEV="0.00" GAP_MAX="0" />
<FEATURES>
</FEATURES>
</RESTRICTION_MAP>
I wish I had mate pairs to correct! My assembly is based on single end reads. Your solution is a really good for a misassembly that involves paired end reads already. Bambus looks good too (one more tool to add to my toolbox!), but I would not know where to break my misassembled contig so that I could use it.
so you are saying that you have chimeric contigs? you can try to map your reads to your contigs, and look for the regions with low read coverage. remove those reads, reassemble. Bear in mind OM can also contain errors.
That's a very good point that OpGen maps can contain errors. In a recent seminar at my institution, they discussed how they may eventually bring in confidence scores (or something approximating that), but for now they do not and I am considering them as high confidence. I have chimeric contigs, but I do not have an assembly file (ace, afg, etc) due to my comprehensive assembly process. However, I may choose to just use Newbler so that I have an ace file, and then use your method. That is a good idea.