Hello,
I have a small haploid genome (85 Mb) that was assembled with Canu based on ~100x of PacBio Sequel reads. In addition, a batch of 40 Gbp Hi-C Illumina reads was sequenced to perform scaffolding. The assembly has been polished with Arrow, but there is not a third dataset of Illumina reads to polish with Pilon. I was wondering if I could instead use the Hi-C reads to perform the Illumina polishing step by mapping one or both ends of the reads individually to the assembly. However, given the nature of Hi-C reads, I am a little concerned that the uneven coverage and chimeric reads could have a negative impact. Anyone has previous experience with this approach? Is it a good idea to use Hi-C reads to polish an assembly?
Thanks
The uneven coverage means polishing will be uneven, with some regions unpolished. As for the chimeric reads, you could use only reads mapping end-to-end to the reference, e.g., using samclip.
Thanks h.mon for the suggestion. Like you pointed out, using only end-to-end mapped reads could still be useful to polish regions of the genome. I will give it a shot and see how it looks.
how did it go, I was thinking the same?
I gave it a shot, but did not move forward with it. Based on the info I gathered, it can be done but there is no guarantee of the results. At the end, we decided to sequence more Illumina data for the polishing step to avoid downstream problems. But I can still describe what I did:
To polish the assembly with Hi-C reads, I mapped both ends individually with bwa mem. After removing unmapped reads, supplementary and secondary alignments with samtools, I removed PCR-duplicated reads with Picardtools. Clipped reads were also removed with samclip, since they are likely chimeric reads.
Using the dataset described above, Pilon confirmed 99% of the bases in the assembly (previously polished with Arrow), and performed 726 changes, of which 88% were correction of single-base INDELs. To me, these numbers suggest that the polishing was successful. Again, we did not move forward with it to avoid downstream problems since this is not a common approach and I have no seen in depth analyses of possible complications.