Question

How to merge several BED files to create a comprehensive overview file

0

Entering edit mode

4.2 years ago

Jordi ▴ 60

I have a few .bed files from different companies offering exome sequencing kits.

I would like to have a file that summarizes all target regions for all these kits. The .bed file have a basic structure composed for three columns (chr#, Start, End). I would like to get an output table that shows which genomic regions are covered only by one of these kits, and which regions are covered by more than one (and which ones). The best way to illustrate this is by an example:

Kit 1

chr#	Start	End
1	100	300

Kit 2

chr#	Start	End
1	150	350

Kit 3

chr#	Start	End
1	80	200

I would like to merge and intersect the files for an output that divides the regions into subregions based on overlap between the input files. It should looks something like this:

chr#	Start	End	Kit 1	Kit 2	Kit 3
1	80	100	0	0	1
1	100	150	1	0	1
1	150	200	1	1	1
1	200	300	1	1	0
1	300	350	0	1	0

I would prefer to do this in python, but I could try in R as well. I have created a #pandas dataframe containing ranges from all kits, ordered by 'chr#' and 'Start' coordinates, which looks like this:

pandas dataframe containing ranges from all kits, ordered by chr# and Start coordinates

Any help would be appreciated.

pandas granges python bioconductor • 1.7k views

ADD COMMENT • link updated 4.2 years ago by Jorge Amigo 14k • written 4.2 years ago by Jordi ▴ 60

1

Entering edit mode

4.2 years ago

Jorge Amigo 14k

When comparing regions in bed format, bedtools or bedops are usually the best way to go.

If the kits you're comparing are exome sequencing bedfiles, I would suggest to add RefSeq's exons for instance to the comparison so that those overlapping intervals have some meaning at the end.

ADD COMMENT • link 4.2 years ago by Jorge Amigo 14k

score 2 · Accepted Answer · 2021-03-26

2

Entering edit mode

4.2 years ago

Pierre Lindenbaum 166k

use bedtools multiinter

Summary: Identifies common intervals among multiple
     BED/GFF/VCF files.

Usage:   bedtools multiinter [OPTIONS] -i FILE1 FILE2 .. FILEn
     Requires that each interval file is sorted by chrom/start.

ADD COMMENT • link 4.2 years ago by Pierre Lindenbaum 166k