Question

Software To Find Overlaps Of Chip-Seq Peaks In Multiple Samples.

3

Entering edit mode

12.1 years ago

pmoney1 ▴ 80

Hi,

I've calculated peaks in 18 samples using the Bioconductor package Ringo. So I basically have ranged data in each sample that show where the peaks are, e.g. chr1:250-630.

Is there an easy way for me to find peaks which overlap in at least 14 out of the 18 samples?

I know this would be fairly easy to code myself, but I'm sure there must be some software available (peferrably in R) for exactly this application?

chip-seq seq • 19k views

ADD COMMENT • link updated 12.1 years ago by Mikael Huss 4.8k • written 12.1 years ago by pmoney1 ▴ 80

score 6 · Answer 1 · 2012-10-31

For R, use ChIPpeakAnno, specially the function ﬁndOverlappingPeaks. You can loop or l/sapply, depending on what you want to do.

You can also use intersectBed from Bedtools suite. Using system command in R, you can call the tool and read the results from the terminal. Also, check the Bedtools Compare Multiple Bed Files? for intersecting multiple files.

Cheers

score 2 · Answer 2 · 2012-10-31

The bedops tool in the BEDOPS suite will find overlaps between multiple (two or more) BED files. You can specify custom overlap criteria or use the default, which is one base of overlap. Please see documentation for the --intersect and --element-of operators for more detail. You could invoke a set operation from R with a system call.

1) To find overlaps with your criteria, first add your sample number to the ID field of the BED file. For example, for sample 5 of 18, the BED file Sample5.bed might look like:

chrN    1234    5678    peak-8724-sample-5
...

You could do this with awk, Perl, whatever.

Note that if your elements already have IDs that are unique to the sample, then you don't need to do this step. This step just ensures that you can retrieve the element information later on.

Then collate the samples to build a "master list":

$ bedops --everything Sample1.bed Sample2.bed ... Sample18.bed > MasterList.bed

2) Next, map the master list against itself with bedmap. Use the --count operator to give the number of peaks that the BED element overlaps, and the --echo-map-id operator to show the IDs of the peaks that overlap:

$ bedmap --echo --count --echo-map-id MasterList.bed MasterList.bed

The output will show you the number of "hits" or overlaps for each element, along with the IDs associated with those hits, e.g.:

chrN    1234    5678    peak-8724-sample-5  | 2 | peak-8724-sample-5;peak-8724-sample-12
chrN    2563    3125    peak-6258-sample-12 | 2 | peak-8724-sample-5;peak-8724-sample-12
chrN    7746    8913    peak-5923-sample-1  | 1 | peak-5923-sample-1
...

In other words, element chrN:2563-3125 from sample 12 overlaps with itself and one other element in your master list, which comes from sample 5.

3) To speed up step 4, filter this result for elements which have hit counts greater than or equal to 14 with a Perl or awk script. Because you pre-processed the ID field, you know which element comes from which sample, which satisfies your criterion. Note that this gives you peaks that may overlap within the same sample, so this isn't a guarantee that you have an overlap with 14 or more samples.

4) Therefore, finally, you would want to parse the list of overlapping elements with, for example, a Perl or awk script to get a unique number of samples for that element. For example, parsing the mapped ID results for the chrN:2563-3125 element shows itself (of course) and a peak from sample 5, so that peak gets a "true" count of 2 samples, and so on. Filter this final list of sample counts for elements with a score of 14 or greater, and there's your answer.

This is definitely more work than other solutions, but I think this may give you more useful information, like the IDs of specific peaks as well as their overlap characteristics within a sample and across samples. Using bedops --merge discards ID information (not sure what mergeBed does here). I'd probably want to first make sure that overlaps between peaks in the same sample are taken into account when counting. Hopefully this is useful to you, in any case.

score 1 · Answer 3 · 2012-10-31

1

Entering edit mode

12.1 years ago

Irsan ★ 7.8k

Definetely something that can be solved with bedtools or bedops suite (this one especially useful for multiple bedfiles). Just Google for them

ADD COMMENT • link 12.1 years ago by Irsan ★ 7.8k

score 1 · Answer 4 · 2012-10-31

1

Entering edit mode

12.1 years ago

Matt Shirley 10k

If you are already working in R, I would suggest you take a look at GRanges and I(interval)Ranges. Here is a short presentation from someone in our computational genomics department about why he likes them both, and I generally agree with him.

ADD COMMENT • link 12.1 years ago by Matt Shirley 10k

score 0 · Answer 5 · 2012-11-03

0

Entering edit mode

12.1 years ago

Mikael Huss 4.8k

The DiffBind Bioconductor package is a pretty convenient wrapper for this kind of thing. It lets you calculate peak-wise overlap statistics or merge your peaks into a consensus set and then assess the tag enrichment for each sample, if you prefer that. It also interfaces to DESeq to calculate statistically significant differential binding.

ADD COMMENT • link 12.1 years ago by Mikael Huss 4.8k

0

Entering edit mode

Hello Mikael,

As I came across this post and In relation to it I am also working with diffbind and facing problems with it. So i need to ask have you also used DiffBind before?

ADD REPLY • link 11.1 years ago by Ram ▴ 190

0

Entering edit mode

Yes, not very much though. It has worked well for me.

ADD REPLY • link 11.1 years ago by Mikael Huss 4.8k

0

Entering edit mode

Hi, Thanks for replying. As I am working on my dataset and at first made csv file for it, but with further steps I am getting error. It would be quite helpful if you could throw some light on this issue. Like (Tamoxifen) how we can generate table of our own data with intervals?

ADD REPLY • link 11.1 years ago by Ram ▴ 190

0

Entering edit mode

I suggest you post this as a new question. It's too inconvenient to discuss in a comment thread. Please make sure to describe the problem in as much detail as possible.

ADD REPLY • link 11.1 years ago by Mikael Huss 4.8k