Does multiIntersectBed works with -u and -f parameter?
2
1
Entering edit mode
9.8 years ago
ivivek_ngs ★ 5.2k

Does the multiIntersectBed function of the bedtools work with the parameter of fraction (-f). I am trying to intersect 4 bed files together to see if they have overlaps together but minimum bases should be around 50% , am not interested in single base match. I am interested in finding if 50% or above of bases in each file match among all bed files giving me the pattern showing the listing of the column where I can see all the labels together. I am trying to do something like this but it does not seem to work. if anyone can give me some suggestions how to get this done? It seems that -u and -f does not work here. Is there a way I can do that? I tried BEDOPS as well, not that much of a help. I would appreciate some input.

multiIntersectBed -i file1.bed file2.bed file3.bed file4.bed -header -names A B C D -header -u -f .50| head

Regards

bed sequencing bedtools intersect • 6.0k views
ADD COMMENT
0
Entering edit mode

Although it is probable that someone in here will know the answer, you are more likely to get answers for your bedtools questions in the google group for bedtools. The developers are usually very quick to reply to queries, and often questions have been asked (and answered) before in there.

ADD REPLY
8
Entering edit mode
9.8 years ago

You can't do this directly with bedtools multiinter, but you can answer this by combing a few tools. It is a smidge complicated, but it should work and it demonstrates how combining multiple bedtools can usually get you the answer you need.

➜ cat a.bed
chr1    15    20    a1
chr1    18    25    a2   
➜ cat b.bed
chr1    16    20    b
➜ cat c.bed
chr1    17    21    c

Step 1: Find intervals common to all three files. The fourth column of the output is the number of datasets present at the interval, so it can be used as a filtering criterion.

➜ bedtools multiinter -i a.bed b.bed c.bed | awk '$4 == 3' > common.bed
➜ cat common.bed
chr1    17    21    3    1,2,3    1    1    1

Step 2. The issue now is that you don't know the degree of overlap these intervals have with the original intervals in each file. So, we will now just use the intersect tool to intersect the common intervals with the original 3 files while requiring 50% overlap.

➜ bedtools intersect -a common.bed \
                      -b a.bed b.bed c.bed \
                      -f 0.50 -r \
                      -wa -wb \
    > common.with.originals.bed
➜ cat common.with.originals.bed
chr1  17  21  3   1,2,3  1  1  1  1  chr1  15  20  a1
chr1  17  21  3   1,2,3  1  1  1  1  chr1  18  25  a2
chr1  17  21  3   1,2,3  1  1  1  2  chr1  16  25  b
chr1  17  21  3   1,2,3  1  1  1  3  chr1  17  21  c

Step 3. Now we have to filter the output from step 2 to ensure that the "common" intervals indeed overlapped with all 3 datasets. Conveniently, when using the -wb option with multiple "B" files, the intersect tool reports the dataset from which the hit came. In this case, that will be the 9th column. So, we just need to test that each interval from step 2 had hits from all three input files. We can do this with the groupby tool.

➜ bedtools groupby -g 1-8 -c 9 -o count_distinct -i common.with.originals.bed
chr1  17  21  3   1,2,3   1   1   1   3

Since the last column now reflect the count of distinct datasets that overlapped the common interval at 50%, you can use awk to filter for only those intervals where the count is 3.

➜ bedtools groupby -g 1-8 -c 9 -o count_distinct -i common.with.originals.bed \
  | awk '$9==3'
chr1  17  21  3   1,2,3   1   1   1   3
ADD COMMENT
0
Entering edit mode

Aaronquinlan thank you for this,

I seem to be having problems with one of the bedtools command

I have version bedtools v2.26.0

I ran the same commands and the line below is giving me problems:

bedtools groupby -g 1-8 -c 9 -o count_distinct -i common.with.originals.bed

The output was just 3

Would greatly appreciate your help,

Thanks,
Linda

ADD REPLY
0
Entering edit mode
9.8 years ago

I tried BEDOPS as well, not that much of a help. I would appreciate some input.

There is documentation to explain how to find reciprocal overlap between multiple inputs. You could do:

$ bedops -u file1.bed file2.bed file3.bed file4.bed \
    | bedmap --echo --echo-map-id-uniq --fraction-both 0.5 - \
    | awk -F"|" '(split($2, a, ";") > 1)' \
    > answer.bed

If this is not what you are trying to do, please provide more detail.

ADD COMMENT
0
Entering edit mode

Alex Reynolds, I tried the above command. The thing is I have found CNV regions of my data and among this regions I have found promoter regions, methylation sites, and CpG sites. Now I have extracted the genomic coordinates of all these sites in form of bed files. The promoter, methylation and CpG sites are hits from UCSC. Now I want to take all the 4 files and do an overlap to see if at any region there is a common overlap among all the four files but it should not be at 1 base region since my file1.bed is a CNV bed file of my sample. So I want to see at any point in the file1.bed are there few bases that will correspond to similar bases in file.bed, file3.bed and file4.bed. But it should span a few bases not just a single bases overlap. I hope things make sense now. So I want overlap should be among all the 4 bed files together spanning at least more than 1 bases and by that it would even work if 30% of my bases in CNV which is file1.bed gets overlap at the same time with all the other 3 files for same bases. I hope it is a bit clear now.

ADD REPLY
1
Entering edit mode

The --fraction-both option designates, at minimum, how much the two elements (from any set) should overlap. If this option is left out, then the minimum overlap criteria is one base. But by including the --fraction-both option, you override the one base overlap criterion with the custom fractional preference.

If you are instead measuring cardinality -- or how many elements overlap between sets -- and not the extent of interval overlap, then you could do something like bedmap --indicator and awk between pairs of sets, to get the fraction of cardinal-overlap:

$ bedmap --indicator A B | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_AB.txt 
$ bedmap --indicator B A | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_BA.txt
$ bedmap --indicator A C | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_AC.txt
$ bedmap --indicator C A | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_CA.txt
$ bedmap --indicator A D | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_AD.txt
$ bedmap --indicator D A | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_DA.txt
$ bedmap --indicator B C | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_BC.txt
$ bedmap --indicator C B | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_CB.txt
$ bedmap --indicator B D | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_BD.txt
$ bedmap --indicator D B | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_DB.txt
$ bedmap --indicator C D | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_CD.txt
$ bedmap --indicator D C | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_DC.txt

Basically, for N sets, you are doing N(N-1) comparisons (a set will have 100% concordance with its own elements).

Once you have these results, you can put them into an N x N matrix and then perhaps plot a heatmap. You can plot numbers in cells to quickly see fraction of cardinal-overlap and determine if you get over 50% concordance between all sets. Or you can just programmatically read out each of the rows or columns and see if the fractional threshold is met.

ADD REPLY
0
Entering edit mode

I am confused with (split($2, a, ";") > 1) part. I will for sure give it a try but I used bedops for the first time today. Just to make sure with the command of bedops and bedmap can I achieve what am looking? like an overlap of genomic interval from all 4 files together?

ADD REPLY
0
Entering edit mode

It might help to put in tee statements so that you see what gets piped from one step to the next.

ADD REPLY
1
Entering edit mode

A simpler option may also be to take the multiset union of the inputs to make a "master element list", and do the bedmap --indicator and awk step against this master list for each of the four sets:

$ bedops --everything A B C D > Master
$ bedmap --indicator A Master | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_A_Master.txt
$ bedmap --indicator B Master | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_B_Master.txt
$ bedmap --indicator C Master | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_C_Master.txt
$ bedmap --indicator D Master | awk 'BEGIN{s=0;n=0;}{s+=$1;n++;}END{print s/n;}' > fraction_D_Master.txt

You could then check if the fraction_* files are all greater than 50%.

ADD REPLY

Login before adding your answer.

Traffic: 1526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6