Combine .bed files and keep columns / headers
2
0
Entering edit mode
5.5 years ago
fr ▴ 220

I'm combining multiple .bed files. For instance:

#file1
chr1    1   2    2
chr1    10  11    3
chr1    50  51   4

#file2
chr1    1   2   10
chr1    10  11   8
chr10   2   3   8

#file3
chr1    1   2   1
chr1    50  51   2
chr10   2   3   9

All files have the same structure for the 4 columns: chr start end filename

I want to combine all of them (have a lot of them!) such that the column names are kept for columns 4 and higher.

chrm start end file1 file2 file3
chr1    1   2   2    10    1
chr1    10  11  3    8      NA
chr1    50  51  4    NA   2   
chr10   2   3   NA  2      9

I have tried this approach but the end result doesn't have column names and doesn't show the multiple matching columns (i.e. output only has 4 columns, with everything concatenated vertically).

I have also tried this approach which seems to work a bit better, but doesn't have column names either - and for a large number of files, it becomes very hard to assign them. It also seems that it fails for chr 10 - 22 as indicated here.

Could someone please help me get the last output? I would really appreciate it!

Thanks a lot in advance

bedtools bedops bedmap • 2.8k views
ADD COMMENT
3
Entering edit mode
5.5 years ago
ATpoint 85k

I think this two-step process should do it:

bedtools multiinter -empty -i <all*bed*files>.bed -g chromSizes.txt \
  | awk 'OFS="\t" {if($5 == "none") next} {print}' \
  | tee >(cut -f1-3 > coords.bed) \
  | cut -f6- \
  | awk 'OFS="\t" {gsub("^0$","NA");print}' > values.txt

paste coords.bed values.txt > output.txt

Essentially, it exploits bedtools multiinter's -empty function to output null for intervals that are empty in some files. As this is only possibe together with a genome file, one simply awks out all intervals empty in any file. One then separates the coordinates from the rest to be able to 1) exclude unwanted output from multiinter and 2) to be able to do a simply gsub replacing zeros with NA. Eventually, paste the two parts together.

ADD COMMENT
0
Entering edit mode

Thanks so much @ATpoint, I'm almost there. However, I get only 1 and NA in output.txt, and don't know how to intersect it with the initial input files. Also, is the order of columns 4 and so on the same as that specified in -i [file1.bed, file2.bed, file3.bed]? It doesn't seem that the output is sorted, so I just want to make sure

ADD REPLY
0
Entering edit mode

With your dummy data I get exactly the output you gave as the desired one.

ADD REPLY
0
Entering edit mode

I really don't understand why, but now I can't even get the NA. I have used \t as delimiters in my input, and now I'm getting

#cat output.txt
chr1    1   2   1   1   1
chr1    2   5   1   0   0
chr1    10  11  1   1   0
chr1    50  51  1   1   1
chr10   2   3   0   0   1
ADD REPLY
1
Entering edit mode

chr1 2 5 1 0 0 this line does not make sense as it is not in your dummy input. Maybe you have left-over files from previous attempts to run the code snippet, make sure you clean up properly.

ADD REPLY
2
Entering edit mode
5.5 years ago

You could use the second part of my answer to get NaN values, and just cat a header at the end with awk or some other scripted approach.

ADD COMMENT
0
Entering edit mode

Thanks. I was able to get it working with dummy data above, but not with my own data. I hope to get there soon! Thanks a lot

ADD REPLY
1
Entering edit mode

Perhaps compare the structure and content of the dummy data with the structure and content of a small snippet (head) of your own data. This should help guide you in preparing your own data or modifying the commands to use your data as-is.

ADD REPLY
0
Entering edit mode

Indeed. Thanks a lot, both your post and @ATpoint's below were quite useful. I got it working but for my purposes I think --partition was better than merge

ADD REPLY

Login before adding your answer.

Traffic: 1848 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6