Why does my Hi-C contact map show large regions making little to no contact?
1
0
Entering edit mode
6 months ago
Winter • 0

Hello everyone, bioinformatician in training here.

Currently working on Hi-C scaffolding. I have a set of scaffolds which were produced following the VGP genome assembly pipeline (HiFi data from a single individual, pooled Hi-C data, no Bionano data) (https://doi.org/10.1038/s41587-023-02100-3) which have now been mapped using BWA-MEM2, refined with YAHS, and then mapped with PRETEXT.

However, the contact maps look like nothing I've seen before. Only about half of each putative chromosome/bin appears to have a high contact frequency, and the other half appears to make no contact at all. Moreover, based on previous publications, we should have about 20 chromosomes, but we're seeing at least 50.

I've attached images of the full map + a representative bin. Any idea what might be causing this? I feel I must have made an error somewhere upstream in the assembly, but it's possible there are additional steps I need to take to refine the Hi-C data. Any suggestions?

So far, I've tried correcting all the potential mistakes I could think of, but none of these changes made any noticeable difference.

  1. First, I'm working with two lanes of Hi-C reads (2 forward, 2 reverse). Initially, I was merging the raw FASTQ files before mapping, but I've since been mapping each set of forward and reverse reads independently with BWA-MEM2 and then merging the BAM files. I've also tried using only a single lane of forward and reverse reads.
  2. Initially, I deviated from the VGP pipeline and decontaminated the contigs before scaffolding, but I've also tried scaffolding without decontamination as well.
  3. I also considered whether the Hi-C reads have some sort of adapter contamination, but the company has said they already removed the adapters and FASTQC seems to support this.
  4. I have also tried increasing the minimum mapping quality threshold for PRETEXT map (-mapq) to 30.

For what it's worth, the quality control measures for our contigs look great. The length matches closely with our estimated genome size and the BUSCO score is 99.6% with almost no duplicated genes.

Thanks everyone!

Full

Single

BWA-MEM2 Hi-C PretextMap • 663 views
ADD COMMENT
0
Entering edit mode
4 months ago

Hi, Which % of mapping contacts do you have? You can obtain it using samtools flagtats, you could also try to add the coverage inside your pretextmap... However i agree that it looks strange, you could maybe try to approach with only one of the HiC lanes.

It seems like somehow YAHS hasn't scaffolded properly and has been splitting scaffolds that should be together...

Greetings,

ADD COMMENT
0
Entering edit mode

Thanks for the reply! I had used just a single Hi-C lane and had the same strange results. Moreover, the results look broadly indistinguishable before and after YAHS. We ended up asking the sequencing company to do the mapping for us, and they used a different pipeline and were able to do so successfully. Perhaps some user error on my end or some strange issues with Galaxy or the packages being used. In the end, however, I still have no clear idea about what exactly went wrong.

ADD REPLY

Login before adding your answer.

Traffic: 784 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6