Question

Combining multiple bedfiles

2

Entering edit mode

6.5 years ago

ccagg ▴ 60

I have many (>100) bed files in this format

file1

chr1    1   2    2
chr1    10  11    3
chr1    50  51   4

file2

chr1    1   2   10
chr1    10  11   8
chr10   2   3   8

fileN

chr1    1   2   1
chr1    50  51   2
chr10   2   3   9

where the files have some sites in common, but not all of them.

My goal is to obtain a file that looks like this

fileN

chrm start end file1 file2 fileN
chr1    1   2   2    10    1
chr1    10  11  3    8      NA
chr1    50  51  4    NA   2   
chr10   2   3   NA  2      9

I want to be able to do this efficiently for all 100 files that I have. I was thinking about using bedtools but don't want to use many many lines of intersections. The files are quite large as well. Does anyone have a suggestion on how to do this? Thanks!

bed bedtools • 4.9k views

ADD COMMENT • link updated 4.5 years ago by Biostar 20 • written 6.5 years ago by ccagg ▴ 60

0

Entering edit mode

6.5 years ago

finswimmer 16k

Hello,

you could use a slightly modified version of my python script in this answer:

import glob
import sys
import pandas


def read_filenames(names):
    for arg in names:
        if "*" in arg:
            for file in glob.glob(arg):
                yield file
        else:
            yield arg


data = []

for sample in read_filenames(sys.argv[1:]):
    frame = pandas.read_csv(
        sample,
        sep="\t",
        header=None,
        names=["chr", "start", "end", sample],
        index_col=[0, 1, 2]
    )
    data.append(frame)

data_joined = data[0].join(data[1:], how='outer')

print(data_joined.to_csv(
    sep="\t",
    na_rep="NA",
    float_format='%.0f'
    ))

Run it like this:

$ python join.py File* > result.csv

fin swimmer

ADD COMMENT • link 6.5 years ago by finswimmer 16k

score 6 · Accepted Answer · 2018-06-12

BEDOPS bedops and bedmap are efficient in memory and time on whole-genome scale files. These tools also work with standard input and output streams, which allows chaining of operations and integration of set operations with standard Unix tools. These features make work on hundreds of files tractable.

Method 1

If you can get away without the NA, given sorted single-base BED files A.bed, B.bed through N.bed (N=100 or whatever), you could take their union and their merge (these are all stored under the directory files):

$ bedops -u files/A.bed files/B.bed ... files/N.bed > union.bed
$ bedops -m files/A.bed files/B.bed ... files/N.bed > merge.bed

Or if you want to work on all the files in this directory, more simply:

$ bedops -u files/*.bed > union.bed
$ bedops -m files/*.bed > merge.bed

Then map the IDs in the union file to the intervals in the merge file:

$ bedmap --echo --echo-map-id --delim '\t' merge.bed union.bed > answer.bed

The file answer.bed would look like:

chr1    1   2   1;10;2
chr1    10  11  3;8
chr1    50  51  2;4
chr10   2   3   8;9

The --multidelim '\t' option could be added, but using the default semi-colon delimiter may make it easier to demonstrate the result. In this approach, there are no NA values and ID values are presented in lexicographical order, not the order of original input files.

Method 2

If you need an NA value positioned in order from columns 4 through N+3, where ID values are missing, then one option is to loop over all N BED files to generate N per-file maps:

$ for fn in `ls files/*.bed`; do echo $fn; bedops -n 1 merge.bed $fn | awk '{ print $0"\tNA" }' | bedops -u - $fn | cut -f4 > $fn.map; done

The pipeline of bedops commands here uses merge.bed to fill in gaps where intervals are missing, adding NA as the ID value for missing intervals before doing the final mapping step with cut. At the end, each of the per-file maps is a column in the result you ultimately want.

Finally, use paste to glue all these columns together into an N+3 column matrix:

$ paste merge.bed files/*.bed.map > answer.mtx

The file answer.mtx will contain the intervals in the first three columns, and ID values from columns 4 through N+3, for input files A.bed through N.bed. It'll look like this:

chr1    1   2   2   10  1
chr1    10  11  3   8   NA
chr1    50  51  4   NA  2
chr10   2   3   NA  8   9

Basically, this is the result you want, but without the chrm ... fileN header at the top.

Unix pipelines demo'ed in this answer will make all of this work go faster, by avoiding generating intermediate files as much as possible.

If you have access to a compute cluster and each file is very large, or you have many hundreds of files, and you need to automate this process, each iteration or step in the serial per-map loop can be parallelized as a separate job on the cluster. Then you have a final, "dependent" job that does the paste step at the end, in map-reduce fashion.