Question

Bedtools Compare Multiple Bed Files?

21

Entering edit mode

13.1 years ago

Bioscientist ★ 1.7k

I've been dealing with comparison between two bed files using intersectBed -a -b command. I'm just wondering, is there any commands in Bedtools which can help us compare multiple bed files?

Say, I have 3 bed files (A,B,C). I want to identify those regions where any two of the three (AB,BC,AC)overlaps reciprocally 50%.....

thx

edit: Just find this post right now.Maybe I didn't express quite well a couple of months ago. I mean to find those overlappings which spans at least 50% of EACH of the multiple bed files. So I don't quite understand cat AB BC AC > ABC.common Means to find the overlapping part of all the three?

I myself try to solve the problem like below:

intersectBed -a 2 -b 3 > 23
intersectBed -a 1 -b 3 > 13
intersectBed -a 1 -b 2 > 12

intersectBed -a 1 -b 23 -f 0.50|sort > 23_1
intersectBed -a 2 -b 13 -f 0.50|sort > 13_2
intersectBed -a 3 -b 12 -f 0.50|sort > 12_3

comm -1 -2 23_1 13_2 > test
comm -1 -2 test 1_3 > final result

I don't know if I'm on the right track. thx

bedtools intersect • 83k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 13.1 years ago by Bioscientist ★ 1.7k

Ram · Answer 1 · 2011-11-03

Inspired by the limitations of the approaches I mentioned above, I just released a new tool called multiIntersectBed in bedtools version 2.14.3. I realize that this solution doesn't address your request for 50% reciprocal overlap, but I can't yet envision an efficient way to do that other than what has already been proposed.

The basic concept of this approach is that it compares the intervals found in N sorted (-k1,1 -k2,2n for BED) BED/GFF/VCF files and reports whether 0 to N of those files are present at each interval.

An example is likely best to illustrate what the tool does. First, here's a graphical representation:

alt text

Now, an example with real BED files and real output.

$ cat a.bed 
chr1    6    12
chr1    10    20
chr1    22    27
chr1    24    30

$ cat b.bed 
chr1    12    32
chr1    14    30

$ cat c.bed 
chr1    8    15
chr1    10    14
chr1    32    34

In the example below, the first three columns define the interval, the fourth column reports the number of files present at that interval, the fifth column reports a comma-separated list of files present at that interval, and the 6th through 8th columns report whether (1) or not (0) each file is present. The order is the same as on the command line.

$ multiIntersectBed -i a.bed b.bed c.bed 
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1

The above example only reports intervals where >=1 file has coverage. We can also get a complete picture of the chrom by using the -empty parameter and by providing a genome (chrom sizes) file:

$ multiIntersectBed -i a.bed b.bed c.bed -empty -g genomes/human.hg18.genome
chr1    0    6    0    none    0    0    0
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1
chr1    34    247249719    0    none    0    0    0

We can also get a header:

$ multiIntersectBed -i a.bed b.bed c.bed -empty -g genomes/human.hg18.genome -header
chrom    start    end    num    list
chr1    0    6    0    none    0    0    0
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1
chr1    34    247249719    0    none    0    0    0

And a header with labels. Note that if we use labels, the fourth column reports a list of labels rather than a list of file indices:

$ multiIntersectBed -i a.bed b.bed c.bed  -header -names A B C
chrom    start    end    num    list    A    B    C
chr1    6    8    1    A    1    0    0
chr1    8    12    2    A,C    1    0    1
chr1    12    15    3    A,B,C    1    1    1
chr1    15    20    2    A,B    1    1    0
chr1    20    22    1    B    0    1    0
chr1    22    30    2    A,B    1    1    0
chr1    30    32    1    B    0    1    0
chr1    32    34    1    C    0    0    1

If you are interested in an easier to follow (yet less efficient) version of the algorithm, I have posted the python prototype I developed with pybedtools.

Ram · Answer 2 · 2011-10-27

One approach with bedtools is the following.

intersectBed -a A.bed -b B.bed -f 0.5 -r > AB
intersectBed -a B.bed -b C.bed -f 0.5 -r > BC
intersectBed -a A.bed -b C.bed -f 0.5 -r > AC
cat AB BC AC > ABC.common

You could generalize this with something like:

for file1 in `ls *.bed`
do
   for file2 in `ls *.bed`
   do 
       if [ $file1 != $file2 ]
         then
            intersectBed -a $file1 -b $file2 -f 0.5 -r > $file1.$file2.common
       fi
   done
done
cat *.common > all.common

If you are a python programmer, you could also do the following with pybedtools, the new python extension of bedtools.

import pybedtools

# set up 3 different bedtools
a = pybedtools.BedTool('A.bed')
b = pybedtools.BedTool('B.bed')
c = pybedtools.BedTool('C.bed')

# make the combinations.
ab = a.intersect(b, f=0.5, r=True)
bc = b.intersect(c, f=0.5, r=True)
aa = a.intersect(c, f=0.5, r=True)

Lastly, there is a thread about this on the bedtools mailing list that may be helpful if you plan on using the pybedtools approach.

Ram · Answer 3 · 2013-01-23

You might check out the BEDOPS suite of tools, which has the capability to work with any number of BED inputs at once. Consider:

  bedops -u file1.bed file2.bed file3.bed \
    | bedmap --echo --count --fraction-both 0.5 - \
    | awk -F"|" 'int($2) > 1' \
    | cut -f1 -d'|' \
   > almost-answer.bed

This gives to you every input row (from any input file) that overlaps some other input row (from any input file) by at least 50% reciprocally (by using the --fraction-both flag). Now, it's possible that two overlapping elements come from the same file (in the general case, though your input files may not have that sort of thing going on), and you probably do not want that. If that's true, this can also be dealt with easily in BEDOPS with a small amount of additional awk code. Here is a more general solution that deals with removing the problem case mentioned:

  bedops -u file1.bed file2.bed file3.bed \
    | bedmap --echo --echo-map-id-uniq --fraction-both 0.5 - \
    | awk -F"|" '(split($2, a, ";") > 1)' \
    | cut -f1 -d'|' \
   > answer.bed

Note that while the code is concise, this usage is also very efficient. An alternative approach that loops over all-pairwise file comparisons requires on the order of (N^2)/2 system calls (and includes N^2 entire file sweeps) and intermediate results (files) to manage on your way to a final solution. This approach quickly worsens if you change the problem in a simple way as well - what if you require that 3 or more input files overlap? The solution shown with BEDOPS scales just fine. You just change the awk statement to ask '> 2' instead of the '> 1' I show.

However, not all is free with the approach I show either. Each of the files above needs to be sorted - which is fast, but also the 4th column in each of these files needs an id that identifies the file. For example, file1.bed should look something like:

chr1  10   20    1  anything-else-you-like  ...
chr1  200  203   1  ...

where the 4th column identifies that this is file 1 (IDs are easily removed later on with a cut command if you don't like them, though a side benefit is that answer.bed will show from which file each row of output originated from). The programs from BEDOPS work with sorted files only. The upfront cost of sorting data pays increasing dividends the more often you use the file, as underlying algorithms can use less memory and run faster. And, any files produced by BEDOPS are already sorted properly, so that removes the sorting overhead entirely when using them for further analyses.

Importantly, there are really only 2 programs you need to know about in BEDOPS to do the vast majority of all queries related to BED. These are bedops and bedmap, which are both used here. Rather than write new programs to answer every form of informatic question, the suite relies upon standard unix utils to manipulate data in simple ways on the fly just as shown here - literally, a one line awk statement. If we exclude the final cut statement above, the output also includes exactly which files are part of the multi-file intersection, on a per-row basis.