Software To Find Overlaps Of Chip-Seq Peaks In Multiple Samples.
5
3
Entering edit mode
12.1 years ago
pmoney1 ▴ 80

Hi,

I've calculated peaks in 18 samples using the Bioconductor package Ringo. So I basically have ranged data in each sample that show where the peaks are, e.g. chr1:250-630.

Is there an easy way for me to find peaks which overlap in at least 14 out of the 18 samples?

I know this would be fairly easy to code myself, but I'm sure there must be some software available (peferrably in R) for exactly this application?

chip-seq seq • 19k views
ADD COMMENT
6
Entering edit mode
12.1 years ago

For R, use ChIPpeakAnno, specially the function findOverlappingPeaks. You can loop or l/sapply, depending on what you want to do.

You can also use intersectBed from Bedtools suite. Using system command in R, you can call the tool and read the results from the terminal. Also, check the Bedtools Compare Multiple Bed Files? for intersecting multiple files.

Cheers

ADD COMMENT
2
Entering edit mode
12.1 years ago

The bedops tool in the BEDOPS suite will find overlaps between multiple (two or more) BED files. You can specify custom overlap criteria or use the default, which is one base of overlap. Please see documentation for the --intersect and --element-of operators for more detail. You could invoke a set operation from R with a system call.

1) To find overlaps with your criteria, first add your sample number to the ID field of the BED file. For example, for sample 5 of 18, the BED file Sample5.bed might look like:

chrN    1234    5678    peak-8724-sample-5
...

You could do this with awk, Perl, whatever.

Note that if your elements already have IDs that are unique to the sample, then you don't need to do this step. This step just ensures that you can retrieve the element information later on.

Then collate the samples to build a "master list":

$ bedops --everything Sample1.bed Sample2.bed ... Sample18.bed > MasterList.bed

2) Next, map the master list against itself with bedmap. Use the --count operator to give the number of peaks that the BED element overlaps, and the --echo-map-id operator to show the IDs of the peaks that overlap:

$ bedmap --echo --count --echo-map-id MasterList.bed MasterList.bed

The output will show you the number of "hits" or overlaps for each element, along with the IDs associated with those hits, e.g.:

chrN    1234    5678    peak-8724-sample-5  | 2 | peak-8724-sample-5;peak-8724-sample-12
chrN    2563    3125    peak-6258-sample-12 | 2 | peak-8724-sample-5;peak-8724-sample-12
chrN    7746    8913    peak-5923-sample-1  | 1 | peak-5923-sample-1
...

In other words, element chrN:2563-3125 from sample 12 overlaps with itself and one other element in your master list, which comes from sample 5.

3) To speed up step 4, filter this result for elements which have hit counts greater than or equal to 14 with a Perl or awk script. Because you pre-processed the ID field, you know which element comes from which sample, which satisfies your criterion. Note that this gives you peaks that may overlap within the same sample, so this isn't a guarantee that you have an overlap with 14 or more samples.

4) Therefore, finally, you would want to parse the list of overlapping elements with, for example, a Perl or awk script to get a unique number of samples for that element. For example, parsing the mapped ID results for the chrN:2563-3125 element shows itself (of course) and a peak from sample 5, so that peak gets a "true" count of 2 samples, and so on. Filter this final list of sample counts for elements with a score of 14 or greater, and there's your answer.

This is definitely more work than other solutions, but I think this may give you more useful information, like the IDs of specific peaks as well as their overlap characteristics within a sample and across samples. Using bedops --merge discards ID information (not sure what mergeBed does here). I'd probably want to first make sure that overlaps between peaks in the same sample are taken into account when counting. Hopefully this is useful to you, in any case.

ADD COMMENT
0
Entering edit mode

Hi Alex, thanks for your reply. I'm not sure though if BEDOPS tools allows me to identify regiions that overlap in "at least 14 out of 18" sample? Maybe I'm missiong something? Looks like a great tool but not sure if there is any way to specify that? Thanks again.

ADD REPLY
0
Entering edit mode

If you want to do this outside of R (see my answer), I think your best tool would be BEDTools, as two people above have suggested.

mergeBed -i peaks.bed -n

This will give you a count of how many merged overlapping peaks there are.

ADD REPLY
0
Entering edit mode

There is no direct way to do this with bedops only, but I'll edit my answer to suggest a mix of bedops and bedmap that might work for you.

ADD REPLY
0
Entering edit mode

Hi, Alex:

Thanks for the post. I just followed your post to find hit counts great or equal to 2 out of 3 samples. Can you please also post the awk script for the step 3 and step 4? Thanks very much.

best

HY

ADD REPLY
1
Entering edit mode
12.1 years ago
Irsan ★ 7.8k

Definetely something that can be solved with bedtools or bedops suite (this one especially useful for multiple bedfiles). Just Google for them

ADD COMMENT
1
Entering edit mode
12.1 years ago

If you are already working in R, I would suggest you take a look at GRanges and I(interval)Ranges. Here is a short presentation from someone in our computational genomics department about why he likes them both, and I generally agree with him.

ADD COMMENT
0
Entering edit mode
12.1 years ago

The DiffBind Bioconductor package is a pretty convenient wrapper for this kind of thing. It lets you calculate peak-wise overlap statistics or merge your peaks into a consensus set and then assess the tag enrichment for each sample, if you prefer that. It also interfaces to DESeq to calculate statistically significant differential binding.

ADD COMMENT
0
Entering edit mode

Hello Mikael,

As I came across this post and In relation to it I am also working with diffbind and facing problems with it. So i need to ask have you also used DiffBind before?

ADD REPLY
0
Entering edit mode

Yes, not very much though. It has worked well for me.

ADD REPLY
0
Entering edit mode

Hi, Thanks for replying. As I am working on my dataset and at first made csv file for it, but with further steps I am getting error. It would be quite helpful if you could throw some light on this issue. Like (Tamoxifen) how we can generate table of our own data with intervals?

ADD REPLY
0
Entering edit mode

I suggest you post this as a new question. It's too inconvenient to discuss in a comment thread. Please make sure to describe the problem in as much detail as possible.

ADD REPLY

Login before adding your answer.

Traffic: 2406 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6