Hi everyone, In Hi-C contact matrix, i see non zero values in diagonal. What is the meaning of self contacting. Should i ignore the diagnoal for further calculation.
Thanks
Hi everyone, In Hi-C contact matrix, i see non zero values in diagonal. What is the meaning of self contacting. Should i ignore the diagnoal for further calculation.
Thanks
The true diagonal contains contacts between loci separated by distances below the size of a bin. The 2nd diagonal has such contacts too, i.e. contacts between pairs of loci located close to the bin boundary, but on the either side of it.
Our lab's usual recommendation is to discard two first diagonals, b/c they are contaminated by non-informative artifacts of the Hi-C procedure: unligated and self-ligated molecules. The former are just pieces of undigested and unligated DNA; the latter are formed when two ends of the same molecule get ligated and then the formed circle gets cleaved elsewhere. Both types of molecules looks like short distance contacts. Unligated DNA pieces look like two contacts with a separation of a few hundred bp and with the sequencing directions pointing toward each other along the genome. Self-ligated molecules usually have a separation of a few kb to 10 kb and have sequencing directions pointing away from each other. Both do not contain information about spatial organization. B/c unligated DNA and self-circles cannot be distinguished from "true" contacts formed by two distinct ligated molecules, their presence essentially invalidates all statistics on short-distance contacts. For this reason, we usually discard the first two diagonals of Hi-C matrices at high resolutions (up to a few tens of kb) or only the first diagonal for low resolution datasets (100kb+).
What's the size of the bins in you matrix? Let's say the bins are 5kb long. The values in the diagonal means that you see interactions at very short distance (less than 5kb). To me these values should be the highest values.
I am calculating probability of contacting pair using zhang and wolynes 2015 PNAS method p(i,j) =min(1,C(i,j)/min(ni,nj)) where ni = max[ C(i-4, i), C(i-3,i), ..., C(i,i+2), C(i,i+3) ] C(i,j) is the contact frequency of pairs. So, i have a doubt that what value of C(i,i) should i used for calculating probability p(i,j).
So you are estimating contacts between fragments. But obviously, fragments which are already very close together (i.e. in the same bin) are highly likely in close proximity with each other. And that's not exactly what you are looking for, probably.
You probably want to check for secondary structure interactions such as looping, not for interactions because of the primary sequence - proximity.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Is this recommendation on diagonal bias (i.e., data processing workflow) published?
uuuggghh, not really! :) The 4DN consortium is currently working on Hi-C data analyses standards, which will address this issue, among others. Though, it will take some time to produce a document. Meanwhile, I'd recommend my all-time favorite guideline to Hi-C data analysis by Noam Kaplan and Bryan Lajoie from Dekker's group. It discusses the non-informative artifacts of Hi-C, though doesn't suggest any particular threshold.
Thank you for your ideas!
I remembered your suggestion to remove #1 & #2 diagonal, which sounds very logical to me. But still, shouldn't we, in ideal world, remove diagonal bias while normalising interaction matrices (i.e., O / E, should remove super and sub-diagonal bias). This is not a question, just random observation that should be tested :)