Bedtools Compare Multiple Bed Files?

Entering edit mode

13.8 years ago

Bioscientist ★ 1.7k

I've been dealing with comparison between two bed files using intersectBed -a -b command. I'm just wondering, is there any commands in Bedtools which can help us compare multiple bed files?

Say, I have 3 bed files (A,B,C). I want to identify those regions where any two of the three (AB,BC,AC)overlaps reciprocally 50%.....

thx

edit: Just find this post right now.Maybe I didn't express quite well a couple of months ago. I mean to find those overlappings which spans at least 50% of EACH of the multiple bed files. So I don't quite understand cat AB BC AC > ABC.common Means to find the overlapping part of all the three?

I myself try to solve the problem like below:

intersectBed -a 2 -b 3 > 23
intersectBed -a 1 -b 3 > 13
intersectBed -a 1 -b 2 > 12

intersectBed -a 1 -b 23 -f 0.50|sort > 23_1
intersectBed -a 2 -b 13 -f 0.50|sort > 13_2
intersectBed -a 3 -b 12 -f 0.50|sort > 12_3

comm -1 -2 23_1 13_2 > test
comm -1 -2 test 1_3 > final result

I don't know if I'm on the right track. thx

bedtools intersect • 87k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 13.8 years ago by Bioscientist ★ 1.7k

Entering edit mode

13.8 years ago

Aaronquinlan 12k

Inspired by the limitations of the approaches I mentioned above, I just released a new tool called multiIntersectBed in bedtools version 2.14.3. I realize that this solution doesn't address your request for 50% reciprocal overlap, but I can't yet envision an efficient way to do that other than what has already been proposed.

The basic concept of this approach is that it compares the intervals found in N sorted (-k1,1 -k2,2n for BED) BED/GFF/VCF files and reports whether 0 to N of those files are present at each interval.

An example is likely best to illustrate what the tool does. First, here's a graphical representation:

alt text

Now, an example with real BED files and real output.

$ cat a.bed 
chr1    6    12
chr1    10    20
chr1    22    27
chr1    24    30

$ cat b.bed 
chr1    12    32
chr1    14    30

$ cat c.bed 
chr1    8    15
chr1    10    14
chr1    32    34

In the example below, the first three columns define the interval, the fourth column reports the number of files present at that interval, the fifth column reports a comma-separated list of files present at that interval, and the 6th through 8th columns report whether (1) or not (0) each file is present. The order is the same as on the command line.

$ multiIntersectBed -i a.bed b.bed c.bed 
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1

The above example only reports intervals where >=1 file has coverage. We can also get a complete picture of the chrom by using the -empty parameter and by providing a genome (chrom sizes) file:

$ multiIntersectBed -i a.bed b.bed c.bed -empty -g genomes/human.hg18.genome
chr1    0    6    0    none    0    0    0
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1
chr1    34    247249719    0    none    0    0    0

We can also get a header:

$ multiIntersectBed -i a.bed b.bed c.bed -empty -g genomes/human.hg18.genome -header
chrom    start    end    num    list
chr1    0    6    0    none    0    0    0
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1
chr1    34    247249719    0    none    0    0    0

And a header with labels. Note that if we use labels, the fourth column reports a list of labels rather than a list of file indices:

$ multiIntersectBed -i a.bed b.bed c.bed  -header -names A B C
chrom    start    end    num    list    A    B    C
chr1    6    8    1    A    1    0    0
chr1    8    12    2    A,C    1    0    1
chr1    12    15    3    A,B,C    1    1    1
chr1    15    20    2    A,B    1    1    0
chr1    20    22    1    B    0    1    0
chr1    22    30    2    A,B    1    1    0
chr1    30    32    1    B    0    1    0
chr1    32    34    1    C    0    0    1

If you are interested in an easier to follow (yet less efficient) version of the algorithm, I have posted the python prototype I developed with pybedtools.

	#!/usr/bin/env python

	from collections import namedtuple, defaultdict
	from pybedtools import BedTool
	import argparse

	Point = namedtuple('Point', ['id', 'pos', 'type'])
	Interval = namedtuple('Interval', ['chrom', 'start', 'end'])


	def report_interval(chrom, start, end, num_files, files_with_interval):
	print "\t".join([chrom, str(start), str(end), str(len(files_with_interval.keys()))]),
	for i in range(0,num_files):
	if i in files_with_interval:
	print "\t1",
	else:
	print "\t0",
	print


	def merge(file):
	"""
	Merge features in a BED/GFF/VCF into non-overlapping intervals
	"""
	start = -1
	end = -1
	chrom = None
	for feature in BedTool(file):
	if feature.start - end > 0 or end < 0 or feature.chrom != chrom:
	if start >= 0:
	yield Interval(chrom, start, end)
	start = feature.start
	end = feature.end
	chrom = feature.chrom
	elif feature.end > end:
	end = feature.end
	yield Interval(chrom, start, end)


	def load_and_sort_points(files):
	"""
	"""
	file_id = 0
	chrom_points = defaultdict(list)
	# for each input file, first merge the features into non-overlapping
	# intervals using merge(). Each non-overlapping feature is then
	# broken up into discrete "Points": one for the start and one for the end.
	for file in files:
	# merge the file and split features into points
	for feature in merge(file):
	s = Point(file_id, feature.start, "start")
	e = Point(file_id, feature.end, "end")
	chrom_points[feature.chrom].append(s)
	chrom_points[feature.chrom].append(e)
	file_id += 1

	# sort the points in for each chrom
	for chrom in chrom_points:
	chrom_points[chrom].sort(key=lambda i: i.pos)
	return chrom_points


	def load_genome(genome):
	chrom_sizes = {}
	for line in open(genome, 'r'):
	fields = line.strip().split("\t")
	if len(fields) > 1:
	chrom_sizes[fields[0]] = fields[1]

	return chrom_sizes


	def nway(files, genome):
	"""
	Assumptions: input files must contain non-overlapping intervals

	1. Example using already-merged files:
	$ cat a.merged
	chr1 6 20
	chr1 22 30

	$ cat b.merged
	chr1 12 32

	$ cat c.merged
	chr1 8 15
	chr1 32 34


	$ ./nway-cluster.py a.merged b.merged c.merged
	#chr st ed num a b c
	chr1 0 6 0 0 0 0
	chr1 6 8 1 1 0 0
	chr1 8 12 2 1 0 1
	chr1 12 15 3 1 1 1
	chr1 15 20 2 1 1 0
	chr1 20 22 1 0 1 0
	chr1 22 30 2 1 1 0
	chr1 30 32 1 0 1 0
	chr1 32 34 1 0 0 1


	2. Example using un-merged, yet sorted files:
	$ cat a.bed
	chr1 6 12
	chr1 10 20
	chr1 22 27
	chr1 24 30

	$ cat b.bed
	chr1 12 32
	chr1 14 30

	$ cat c.bed
	chr1 8 15
	chr1 10 14
	chr1 32 34

	$ ./nway-cluster.py a.bed b.bed c.bed
	#chr st ed num a b c
	chr1 0 6 0 0 0 0
	chr1 6 8 1 1 0 0
	chr1 8 12 2 1 0 1
	chr1 12 15 3 1 1 1
	chr1 15 20 2 1 1 0
	chr1 20 22 1 0 1 0
	chr1 22 30 2 1 1 0
	chr1 30 32 1 0 1 0
	chr1 32 34 1 0 0 1


	3. Thanks to pybedtools, it works with BAM files as well.
	But I hope you have a machine with lots of RAM.
	./nway-cluster.py 1.bam 2.bam 3.bam

	"""
	num_files = len(files)

	# 1. load each point from each interval in each file into
	# a hash keyed by chrom.
	# 2. sort the points in asecnding order for each chrom
	chrom_points = load_and_sort_points(files)
	if genome is not None:
	chrom_sizes = load_genome(genome)

	# 3. Rip through the points and find shared intervals
	for chrom in chrom_points:
	files_with_interval = {}
	prev_point = 0
	for point in chrom_points[chrom]:
	# report the current interval if we've moved at all along the chrom.
	if point.pos > prev_point:
	report_interval(chrom, prev_point, point.pos, num_files, files_with_interval)
	# if we're at a start, we add the current file to the active list of files.
	# otherwise, an end point means we can drop the current file.
	if point.type == "start":
	files_with_interval[point.id] = 1
	else:
	del files_with_interval[point.id]
	prev_point = point.pos

	# if requested, handle the interval from the last observed point to the end of the chrom
	if genome is not None and point.pos < chrom_sizes[chrom]:
	report_interval(chrom, point.pos, chrom_sizes[chrom], num_files, files_with_interval)


	def main():
	parser = argparse.ArgumentParser(prog='nway-cluster')
	parser.add_argument('files', metavar='FILE', nargs='+',
	help='*merged* (non-overlapping intervals) BED files to intersect')
	parser.add_argument('-g', metavar='GENOME', dest='genome', default=None,
	help='The \"genome\" file: i.e., a list of chroms and their sizes.')

	args = parser.parse_args()
	nway(args.files, args.genome)

	if __name__ == "__main__":
	main()

view raw python-prototype.py hosted with ❤ by GitHub

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.8 years ago by Aaronquinlan 12k

Entering edit mode

+1. Great example of how questions on biostar are helping stimulate advances in bioinformatics technology.

ADD REPLY • link 13.8 years ago by Casey Bergman 18k

Entering edit mode

agreed. it was quite fun to write.

ADD REPLY • link 13.8 years ago by Aaronquinlan 12k

Entering edit mode

Hi Aaron, Is the -f parameter still functional in MultiIntersectBed, if I need a minimum overlap of 50% in all the three files.

Thanks

ADD REPLY • link 13.3 years ago by Sukhi Singh 11k

Entering edit mode

oh cannot thanks more..

ADD REPLY • link 13.8 years ago by Bioscientist ★ 1.7k

Entering edit mode

Might be a very naive question, Well I have 5 peak files and the sum of peaks is 100000 but when I use multiIntersectBed I get a total number of peaks to be 150000 why such a big difference? Does the script breaks the total regions into some bins and then makes an intersection?, if yes. what is the size of these bins?

ADD REPLY • link 13.1 years ago by Dataminer ★ 2.8k

Entering edit mode

Almost exactly what I'm looking for. Is there a way to also print out all the features from each of the 3 bed files?

Thanks.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.2 years ago by lethalfang ▴ 160

Entering edit mode

-empty

ADD REPLY • link 11.2 years ago by Sukhi Singh 11k

Entering edit mode

Is multiinter strand-aware? The output doesn't give any clues whether it is!

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 10.0 years ago by m.fletcher ▴ 20

Entering edit mode

13.8 years ago

Aaronquinlan 12k

One approach with bedtools is the following.

intersectBed -a A.bed -b B.bed -f 0.5 -r > AB
intersectBed -a B.bed -b C.bed -f 0.5 -r > BC
intersectBed -a A.bed -b C.bed -f 0.5 -r > AC
cat AB BC AC > ABC.common

You could generalize this with something like:

for file1 in `ls *.bed`
do
   for file2 in `ls *.bed`
   do 
       if [ $file1 != $file2 ]
         then
            intersectBed -a $file1 -b $file2 -f 0.5 -r > $file1.$file2.common
       fi
   done
done
cat *.common > all.common

If you are a python programmer, you could also do the following with pybedtools, the new python extension of bedtools.

import pybedtools

# set up 3 different bedtools
a = pybedtools.BedTool('A.bed')
b = pybedtools.BedTool('B.bed')
c = pybedtools.BedTool('C.bed')

# make the combinations.
ab = a.intersect(b, f=0.5, r=True)
bc = b.intersect(c, f=0.5, r=True)
aa = a.intersect(c, f=0.5, r=True)

Lastly, there is a thread about this on the bedtools mailing list that may be helpful if you plan on using the pybedtools approach.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.8 years ago by Aaronquinlan 12k

Entering edit mode

Hi aaron, maybe you can have a look at my edit of the post.thx

ADD REPLY • link 13.6 years ago by Bioscientist ★ 1.7k

Entering edit mode

12.6 years ago

sjneph ▴ 690

You might check out the BEDOPS suite of tools, which has the capability to work with any number of BED inputs at once. Consider:

  bedops -u file1.bed file2.bed file3.bed \
    | bedmap --echo --count --fraction-both 0.5 - \
    | awk -F"|" 'int($2) > 1' \
    | cut -f1 -d'|' \
   > almost-answer.bed

This gives to you every input row (from any input file) that overlaps some other input row (from any input file) by at least 50% reciprocally (by using the --fraction-both flag). Now, it's possible that two overlapping elements come from the same file (in the general case, though your input files may not have that sort of thing going on), and you probably do not want that. If that's true, this can also be dealt with easily in BEDOPS with a small amount of additional awk code. Here is a more general solution that deals with removing the problem case mentioned:

  bedops -u file1.bed file2.bed file3.bed \
    | bedmap --echo --echo-map-id-uniq --fraction-both 0.5 - \
    | awk -F"|" '(split($2, a, ";") > 1)' \
    | cut -f1 -d'|' \
   > answer.bed

Note that while the code is concise, this usage is also very efficient. An alternative approach that loops over all-pairwise file comparisons requires on the order of (N^2)/2 system calls (and includes N^2 entire file sweeps) and intermediate results (files) to manage on your way to a final solution. This approach quickly worsens if you change the problem in a simple way as well - what if you require that 3 or more input files overlap? The solution shown with BEDOPS scales just fine. You just change the awk statement to ask '> 2' instead of the '> 1' I show.

However, not all is free with the approach I show either. Each of the files above needs to be sorted - which is fast, but also the 4th column in each of these files needs an id that identifies the file. For example, file1.bed should look something like:

chr1  10   20    1  anything-else-you-like  ...
chr1  200  203   1  ...

where the 4th column identifies that this is file 1 (IDs are easily removed later on with a cut command if you don't like them, though a side benefit is that answer.bed will show from which file each row of output originated from). The programs from BEDOPS work with sorted files only. The upfront cost of sorting data pays increasing dividends the more often you use the file, as underlying algorithms can use less memory and run faster. And, any files produced by BEDOPS are already sorted properly, so that removes the sorting overhead entirely when using them for further analyses.

Importantly, there are really only 2 programs you need to know about in BEDOPS to do the vast majority of all queries related to BED. These are bedops and bedmap, which are both used here. Rather than write new programs to answer every form of informatic question, the suite relies upon standard unix utils to manipulate data in simple ways on the fly just as shown here - literally, a one line awk statement. If we exclude the final cut statement above, the output also includes exactly which files are part of the multi-file intersection, on a per-row basis.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 12.6 years ago by sjneph ▴ 690