Question

intersect multiple bed files in R

0

Entering edit mode

8.3 years ago

tobikenobi • 0

Dear all,

I would like to match (intersect) a quite big set of chip-seq peak bed files with a single bed file containing transcriptional start sites (TSS) identified by cage (cap analysis of gene expression). By using a large cohort of transcription factors (TFs), I am hoping to generate a map of the TFs associated with my promoters.

Technically, I have my promoter bed file and currently about 80 chip-seq peak bed files. I would like to look at TF binding in close upstream proximity of my promoters (for instance within 500 bp upstream of the TSS). I read that bedtools could be used for this, but I am limited to a windows laptop and I am unsure if this would work. However, I know a little but of R and would therefore prefer to use any R package that could the job.

I would be very grateful for any hints how I could start the analysis.

Thank you very much!

Tobias

ChIP-Seq R • 8.0k views

ADD COMMENT • link updated 8.2 years ago by Sirus ▴ 820 • written 8.3 years ago by tobikenobi • 0

0

Entering edit mode

8.3 years ago

1769mkc ★ 1.3k

I think you can use the bedtools intersect function ..

ADD COMMENT • link 8.3 years ago by 1769mkc ★ 1.3k

0

Entering edit mode

8.2 years ago

Sirus ▴ 820

You can use the findOverlapsOfPeaks function from the ChIPpeakAnno package.

Just to make the code more easy to follow, I use the readBroadPeak function from the genomation package. But you don't need to use too many packages to do that.

Suppose you have 3 bed files in the boradPeak format you can do

bed1 =  readBroadPeak("bed1.broadPeak")
bed2 =  readBroadPeak("bed2.broadPeak")
bed3 =  readBroadPeak("bed3.broadPeak")

peaksList =  list(bed1 = bed1, bed2= bed2, bed3=bed3)

ovp = findOverlapsOfPeaks(peaksList)
head(ovp$peaklist[["bed1///bed2///bed3"]])

ADD COMMENT • link 8.2 years ago by Sirus ▴ 820

score 3 · Accepted Answer · 2017-03-02

3

Entering edit mode

8.3 years ago

Alex Reynolds 36k

Windows is not a good platform for this line of work. However, if you must run Windows, to do this efficiently, install the BEDOPS toolkit in Cygwin. Or use VirtualBox and install CentOS or Ubuntu Linux in a virtual machine.

Either way, you'll want to have some form of a Unix or Unix-like operating system on your computer.
Depending on the naming scheme for your files, generate the multiset union of N peak files:
```
$ bedops --everything peakA.bed peakB.bed ... peakN.bed > unioned.peaks.bed
```
Map peaks to promoters:
```
$ bedmap --echo --echo-map-id-uniq --delim '\t' promoters.bed unioned.peaks.bed > answer.bed
```
This returns a list of promoters and the IDs of overlapping peaks, where there is one or more bases of overlap between peaks and promoters. If you want the entire peak, use --echo-map in place of --echo-map-id-uniq.

Once you are familiar with how this works, steps 2 and 3 can be merged into one step:
```
$ bedops --everything peakA.bed peakB.bed ... peakN.bed | bedmap --echo --echo-map-id-uniq --delim '\t' promoters.bed - > answer.bed
```
This runs even faster and avoids generating intermediate files.

ADD COMMENT • link 8.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you very much for your reply.

I have two questions before I will actually try your suggestion: 1. If I understand correctly, in step 2 all chip-seq peak files are merged into one big file (unioned.peaks.bed). If I find overlap of peaks with promoters in step 3 , how will I know from which TF peak file they come from? Will the output file state it? 2. How can I set a window (lets say 500 bp ) upstream of the promoter to look for TF / promoter overlap? As my promoter file is actually a TSS file, the width of the genomic locations listed is quite narrow (1-100 bp roughly).

Again, thank you very much for your help!

ADD REPLY • link 8.3 years ago by tobikenobi • 0

0

Entering edit mode

Pre-process your peaks files and add their names to the ID column.

$ awk -vname="1" 'BEGIN { OFS="\t"; }{ print $1, $2, $3, name; }' peakFile1.bed > peakFile1.labeled.bed

Assuming your peak files are named sensibly, you could create a loop to rename them:

$ for idx in `seq 1 80`; do awk -vname="$idx" 'BEGIN { OFS="\t"; }{ print $1, $2, $3, name; }' peakFile$idx.bed > peakFile$idx.labeled.bed; done

Then use bedmap as described, using the labeled peak files instead of the original peak files.

If you want to set a 500 bp upstream window, add --range 500 to the given bedmap command. Overlaps will be reported 500 bases up- and downstream of promoter edges.

ADD REPLY • link 8.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Great! Thank you very much for your help!

Best,

Tobias

ADD REPLY • link 8.3 years ago by tobikenobi • 0