Question

Finding Chip Seq Overlaps with Bed files

2

Entering edit mode

9.0 years ago

morovatunc ▴ 560

Hello,

I have written here about finding overlaps and I came a point where I got very confused. I have tried several methods for for finding overlaps but none of them seem to me logical. I have tried bedtools multi inter, bedops and bedmap. Though please help me a way to find these overlaps.

My data is consisted of 20 files (13 tumour, 7 normal). All of them are bed files. What I wanna know;

Overlapping peaks of both datasets.
Overlaps of from unique ( n=1) to n= 13 for tumour or 7 for normal overlaps.
Bedtools multi inter does this pretty good. However, I realised that it creates false negative overlaps. (2bp region of overlap which makes no sense).
With bedtools intersectbed; I have to make combinations of all of the samples which makes enormous amount combination that confuses me a lot.

Can somebody help me out who has done it before? It should not be that hard?

Thank you very much

Tunc

bedops ChIP-Seq bedmap bedtools • 7.3k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.0 years ago by morovatunc ▴ 560

0

Entering edit mode

"2bp region of overlap which makes no sense" --> why does it make no sense? 1bp overlap is still an overlap if you do not set a minimum number of bp

ADD REPLY • link 9.0 years ago by TriS ★ 4.7k

0

Entering edit mode

Tool: bedtools multiinter (aka multiIntersectBed)

Version: v2.24.0
Summary: Identifies common intervals among multiple
     BED/GFF/VCF files.
Usage:   bedtools multiinter [OPTIONS] -i FILE1 FILE2 .. FILEn
     Requires that each interval file is sorted by chrom/start. 
Options: 
    -cluster    Invoke Ryan Layers's clustering algorithm.
    -header        Print a header line.
            (chrom/start/end + names of each file).
    -names        A list of names (one/file) to describe each file in -i.
            These names will be printed in the header line.
    -g        Use genome file to calculate empty regions.
            - STRING.
    -empty        Report empty regions (i.e., start/end intervals w/o
            values in all files).
            - Requires the '-g FILE' parameter.
    -filler TEXT    Use TEXT when representing intervals having no value.
            - Default is '0', but you can use 'N/A' or any text.
    -examples    Show detailed usage examples.
Error: missing file names (-i) to combine.

This is the help of multi inter. Now please tell me how to specify that? Thank you

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.0 years ago by morovatunc ▴ 560

0

Entering edit mode

9.0 years ago

GouthamAtla 12k

Bedtools Compare Multiple Bed Files?

ADD COMMENT • link 9.0 years ago by GouthamAtla 12k

0

Entering edit mode

I honestly read that thread 20 times. Like I mentioned at my 3rd question, multiinter way causes problems such as false positive occurance. And like I mentioned at my 4th question. since I have too many files, I did ask about alternative methods. I did not started this thread without reading current threads. I am aware of intersectbed, bedops and bedmap are possible ways to solve this.

ADD REPLY • link 9.0 years ago by morovatunc ▴ 560

0

Entering edit mode

5.9 years ago

morovatunc ▴ 560

For the ones who has not found an answer. Homer's mergePeak function exactly what I want.

Link itself is pretty explanatory.

http://homer.ucsd.edu/homer/ngs/mergePeaks.html

However, author seems not to respond problems related with the software so heads up.

ADD COMMENT • link 5.9 years ago by morovatunc ▴ 560

0

Entering edit mode

"based on my experience"

ADD REPLY • link 5.9 years ago by morovatunc ▴ 560

Ram · Accepted Answer · 2015-11-29

5

Entering edit mode

9.0 years ago

Alex Reynolds 36k

Overlapping peaks of both datasets.

First, if not sorted, make sure that your peak, tumour and normal BED files are sorted, e.g.:
```
$ sort-bed tumour01.unknown_sort_state.bed > tumour01.bed
```
Repeat sorting for the remaining peak, tumour and normal BED files, as needed. You only have to sort once, at the beginning.

Take the multiset union of your tumour BED files with bedops, and pipe that unioned set to a second bedops command, to find peaks that overlap all tumour elements:
```
$ bedops --everything tumour01.bed tumour02.bed ... tumour13.bed | bedops --element-of 1 peaks.bed - > peaks_overlapping_tumour_sets.bed
```
Or all normal elements:
```
$ bedops --everything normal01.bed normal02.bed ... normal07.bed | bedops --element-of 1 peaks.bed - > peaks_overlapping_normal_sets.bed
```
Or elements from both categories:
```
$ bedops --everything tumour01.bed tumour02.bed ... tumour13.bed normal01.bed normal02.bed ... normal07.bed | bedops --element-of 1 peaks.bed - > peaks_overlapping_tumour_and_normal_sets.bed
```
If you're trying to do something else, please clarify the kind of set operation or association that you want to do.

For example, do you need to know which tumour or normal element's subset overlaps with a particular peak? The bedmap tool can help you here, but you need to preprocess your tumor and normal element subsets, first. Feel free to follow up.
Overlaps of from unique ( n=1) to n= 13 for tumour or 7 for normal overlaps.

You can use a generalization of this approach for finding elements common to all N subsets. For example, for N=13, where A.bed through N.bed are your 13 tumour element sets:
```
$ N=13
$ bedops --everything A.bed B.bed C.bed ... N.bed \
 | bedmap --count --echo --delim '\t' - \
 | uniq \
 | awk -vN=${N} '$1==N' \
 | cut -f2- \
 > common_to_all_N_tumour_subsets.bed
```
You can modify this approach for N-1 (12) subsets, N-2 (11) subsets, and so on, by modifying the awk test:
```
$ N=13
$ bedops --everything A.bed B.bed C.bed ... N.bed \
 | bedmap --count --echo --delim '\t' - \
 | uniq \
 | awk -vN=${N} '$1==(N-1)' \
 | cut -f2- \
 > common_to_N_minus_1_tumour_subsets.bed
```
You would repeat this for N=7 for your seven normal set files.

Once you have files common_to_*.bed that you need, you can use bedops or bedmap with each of them to do overlap or association tests with peaks, e.g.:
```
$ bedmap --echo --echo-map peaks.bed common_to_all_N_tumour_subsets.bed > common_tumour_elements_that_overlap_each_peak.bed
```

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Dear Alex, Thank your for your detailed answer. I followed the protocol at http://bedops.readthedocs.org/en/latest/content/usage-examples/multiple-inputs.html#multiple-inputs

Which gave me peaks within groups. I guess it will give me the same results. However, I did understand the part where we compare both groups.

Should I merge all peak files in a same bed and do the line below?

$bedmap --count --echo --delim '\t' all_bed_files.bed

Also, You used bedops -elemen of 1 for finding overlaps but I used bedmap. Would there be a significant difference?

Thank you very much for your patient while helping with me.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by morovatunc ▴ 560

0

Entering edit mode

Can you explain what you mean by "compare both groups"? Do you want to compare the peak-overlaps-with-tumour set against the peak-overlaps-with-normal set?

To answer your second question, bedops --element-of 1 just reports an overlap. It won't tell you the associated element that overlaps. To report that association (or "map") you would use bedmap.

ADD REPLY • link 9.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Alex, exactly like you said. I want to compare normal vs tumour. However, I will achieve this with getting all of them to the same bed file. Then, $bedmap --count --echo. Do you prefer another way?

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by morovatunc ▴ 560

0

Entering edit mode

Perhaps you want the following:

$ bedmap --echo --count --fraction-both 0.5 peaks.bed tumours.bed > peaks_with_counts_of_overlapping_tumours.bed
$ bedmap --echo --count --fraction-both 0.5 peaks.bed normals.bed > peaks_with_counts_of_overlapping_normals.bed

You might also count the number of overlaps in common:

$ bedmap --echo --count --fraction-both 0.5 peaks.bed <(bedops --everything tumours.bed normals.bed) > peaks_with_counts_of_overlapping_tumours_and_normals.bed

From these three count numbers, you can build a two-set Venn or Euler diagram of overlap events: The number of overlaps unique to tumours, the number of overlaps unique to normals, and the number of overlaps common to both tumours and normals.

This first pass is a fairly naive approach. You may want to think about normalization with this approach, since a 13-tissue set will likely have more elements than a 7-tissue set, and, by chance, the number of overlap events you get with tumours could be overrepresented by virtue of simply having more elements to start with. You might use bedops to count how many elements are common within the 13 tumour sets, and separately with the 7 normal sets, to determine how to normalize counts of both tumour and normal together.

In any case, please note the use of --fraction-both 0.5 with bedmap, which ensures that an overlapping tumour or normal element covers at least half of a peak element's region. This avoids counting an event as "common", where a tumour element only overlaps on one side of the peak, and a normal element only overlaps on the other side. Requiring 50% or more coverage ensures all elements overlap to be counted as common.

If this isn't clear, draw out three generic intervals on a line and enumerate the different ways overlap events can occur between the three intervals.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Alex,

Thank you for your answer. It solved my problem and This is actually what I want. But I have one last question. when I do this;

$bedmap --count --echo --echo-map-id-uniq --mean --fraction-both .95 --delim "\t" bedops_merge_normalall.bed > answer1.bed.txt

I will explain it by example:

Say we have 5 regions that are overlapping, bedmap overlaps among each other which will create duplicates and this duplicate may mess up the calculations. Therefore, my question How can I get rid of this duplicates? What I did was taking only the unique values. Since they are 4 decimal point numbers, i think taking only the uniqes won't cause me a big problem?

regions,overlapping regions,ave
A -> B,C,D,E -> 15
B -> A,C,D,E -> 15
C -> A,B,D,E -> 15
D -> A,B,C,E -> 15
E -> A,B,C,D -> 15

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by morovatunc ▴ 560

0

Entering edit mode

Merge (-m, –merge)

when the above one to use..so what i get is if i have biological replicates i need this argument Everything (-u, –everything) to make one combined peak list..

ADD REPLY • link 2.6 years ago by 1769mkc ★ 1.2k

2

Entering edit mode

Bedops defines a merge as a calculation operation on intervals, as opposed to the alternative.

The everything operator could also be called a "multiset union", which is why the short option is -u.