Question

Bedtools - Intervals frequency occurence in multiple files

0

Entering edit mode

8.5 years ago

mujupas ▴ 80

Hi!

I have multiple BED files and I know that for each file some of the intervals are shared with some of the other files, while some are unique (for example present in one or 2 files only).

Ideally, the final table should contain all the possible intervals with 2 columns, one for the count of frequency of occurence (for example, found in 4 files) and one with the names of the files in which that interval has been found.

What's the best way to do this?

I was thinking:

Merge all the BED files, and then
Use bedtools coverage to intersect the files one by one with the merged file but I have no idea how to make the plot the output.

I imagine this will require some scripting in perl but this is entirely new to me and I could really use some help...

Thank you in advance!

bedtools • 2.2k views

ADD COMMENT • link updated 8.5 years ago by Alex Reynolds 36k • written 8.5 years ago by mujupas ▴ 80

score 5 · Accepted Answer · 2017-01-30

Say you have n files called 1.bed through n.bed.

You can pre-process your n files to make them four-column BED files, where the fourth column identifies their source filename:

$ for idx in `seq 1 n`; do cut -f1-3 $idx.bed | awk -vidx=$idx '{ print $0"\tfile-"idx; }' - > $idx.id.bed; done

This approach with BEDOPS bedmap --count generalizes to n internally-disjoint input BED4 files, in which the fourth column contains a string identifying the filename:

$ bedops --everything 1.id.bed 2.id.bed 3.id.bed ... n.id.bed \
    | bedmap --echo --count --echo-map-id --delim '\t' - \
    > answer.bed

The bedmap step uses, by default, overlap of one or more bases for inclusion. You can modify this threshold to be more stringent.

The file answer.bed has six columns. The first three columns are the genomic interval. The fourth column is the ID value indicating which file the interval comes from. The fifth column indicates the number of overlaps with all the intervals from all n files. The sixth column is a semi-colon delimited string indicating the files where overlaps are occurring.