Can anyone provide a formal definition of TAD boundaries in bioinformatic language?
For example, I got a bed file from TAD caller (eg. example here is from hiTADs) using 10K bin. I subset the bed file by removing all the subTADs, and got output like below.
chr1 2120000 2340000
chr1 2340000 3790000
chr1 3790000 3840000
chr1 4020000 6050000
chr1 6050000 6750000
I noticed that the TADs are not always back-to-back connected. eg. there are 180Kbp gap between TAD chr1:3790000-3840000
and chr1:4020000-6050000
, which correspond to the low interactive region in Hi-C matrix.
Question: should I consider the "gap" chr1:3840000-4020000
or the "internal bin" next to start/end coordinates or the "external bin" next to start/end coordinates as TAD boundary?
"internal bin" on above example. start ~ start + 10kbp
or end - 10kbp ~ end
chr1 2120000 2130000
chr1 2330000 2340000
chr1 2340000 2350000
chr1 3780000 3790000
chr1 3790000 3800000
chr1 3830000 3840000
chr1 4020000 4030000
chr1 6040000 6050000
chr1 6050000 6060000
chr1 6740000 6750000
"external bin" on above example. start - 10kbp ~ start
or end ~ end + 10kbp
chr1 2110000 2120000
chr1 2330000 2340000
chr1 2340000 2350000
chr1 3780000 3790000
chr1 3790000 3800000
chr1 3840000 3850000
chr1 4010000 4020000
chr1 6040000 6050000
chr1 6050000 6060000
chr1 6750000 6760000
There's not a consensus definition of TAD boundaries between the many programs that call them. The reasons why are many and can be complicated; this short review provides an excellent introduction: https://doi.org/10.1016/j.jmb.2019.09.026
You can get the gist of the different computational methods and corresponding different definitions from these papers too:
(There's not a consensus definition in biology either.)
I'm not entirely clear on your question. I am not sure if this helps you, but if you "pad" or add "slack" to one side of the TAD boundary, then it's common practice to pad or add slack to the other side too. For example, for the TAD with boundary start
3790000
and boundary end3840000
: