Hello, I am a new hand and I encountered this problem when I was scaffolding using Juicer with HiC data. The genome size is approximately 700M with ten chromosome. Using the HiC data, I scaffold only 400M to the ten chromosome. Although, many contigs can be scaffolded to the chromosome by eyes, Juicer failed to locate these contigs to the right position. https://ibb.co/3zCHT8R
Next, I scaffolded these contigs manually. But I got the weird structure! https://ibb.co/R2QS0bx https://ibb.co/G2Pn0ZZ
I have removed the duplication after canu assembling and the the remaining dups is 2%.
Please help me and thank you so much!
This PacBio + Hi-C data is quite new as an approach, I think, and I haven't seen much raw data being released (that could be due to my ignorance of course), despite that this approach is said to produce chromosome scale assemblies (I have focused on copepod genomes mostly). I think it would be easiest to help for someone who had direct access to the data, but of course that might give you some headache. My best bet would be to contact the developers of Juicer. Otherwise, if you were willing to share your data, maybe someone could be willing to help in return for attribution (I would be, but I cannot promise anything).
Last idea: I hear bacterial contamination is a common problem, that might explain your un-integrated sequences. Have you tried to check for bacterial contamination?
Many thanks for your kind help. I would like to share my data with you, but I need the permission from my adviser. And I will check the bactreial contamination as you suggested. Any result I will let you know. Thanks again!
By the way, your heatmaps, are they "chromatin contact" plots? And are these extended stretches possibly linked to the centromeres? How are these plots expected to look?
Yes, it is expected to be chromatin contact. I have think about centromeres too. But the ends of the chromosome are highly interact with each other. Is it normal or by mistake? I am quite confused.
I think these are artifacts caused by telomeric and possibly centromeric repeats.