I have a few .bed files from different companies offering exome sequencing kits.
I would like to have a file that summarizes all target regions for all these kits. The .bed file have a basic structure composed for three columns (chr#, Start, End). I would like to get an output table that shows which genomic regions are covered only by one of these kits, and which regions are covered by more than one (and which ones). The best way to illustrate this is by an example:
Kit 1
chr# | Start | End |
---|---|---|
1 | 100 | 300 |
Kit 2
chr# | Start | End |
---|---|---|
1 | 150 | 350 |
Kit 3
chr# | Start | End |
---|---|---|
1 | 80 | 200 |
I would like to merge and intersect the files for an output that divides the regions into subregions based on overlap between the input files. It should looks something like this:
chr# | Start | End | Kit 1 | Kit 2 | Kit 3 |
---|---|---|---|---|---|
1 | 80 | 100 | 0 | 0 | 1 |
1 | 100 | 150 | 1 | 0 | 1 |
1 | 150 | 200 | 1 | 1 | 1 |
1 | 200 | 300 | 1 | 1 | 0 |
1 | 300 | 350 | 0 | 1 | 0 |
I would prefer to do this in python, but I could try in R as well. I have created a #pandas dataframe containing ranges from all kits, ordered by 'chr#' and 'Start' coordinates, which looks like this:
Any help would be appreciated.