You may also check multiIntersectBed, which is part of the bedtools package and is discussed towards the bottom (second answer from me) of this Biostar thread. I plan to release this on the Galaxy Tool Shed in the next month.
EDIT: More explicit example.
Extend your example two to individuals, A and B.
$ cat all.txt
A chr1 100 1000 alg1
A chr1 150 1200 alg2
A chr1 5000 6000 alg1
A chr1 7000 8000 alg2
B chr1 100 1000 alg1
B chr1 150 1200 alg2
B chr1 5000 6000 alg1
B chr1 7000 8000 alg2
Use awk to split the file into distinct ind/alg files.
awk '{outfile=$1"_"$5".bed"; print $2"\t"$3"\t"$4 >> outfile}' all.txt
List the resulting files as a sanity check. Note the use of ">>" to create and append to files named (outfile=) based on the ind ($1) and the alg. ($5):
ls -1 *_*.bed
A_alg1.bed
A_alg2.bed
B_alg1.bed
B_alg2.bed
Use multiIntersectBed to find intervals that are common to multiple files. The fourth column is the count of files in which the interval is present. The fifth is a list of the file labels (-names argument) in which the intervals were found. The rest of the columns are T/F indicators of whether the interval was found in each file.
multiIntersectBed -i A_alg1.bed A_alg2.bed B_alg1.bed B_alg2.bed -names A1 A2 B1 B2
chr1 100 150 2 A1,B1 1 0 1 0
chr1 150 1000 4 A1,A2,B1,B2 1 1 1 1
chr1 1000 1200 2 A2,B2 0 1 0 1
chr1 5000 6000 2 A1,B1 1 0 1 0
chr1 7000 8000 2 A2,B2 0 1 0 1
You can then use awk or a simple Perl script to limit the results to those intervals involving two or more algs. for the same individual.
Or, more simply, you could just do a separate command for each individual and just look for output where column 4 is >= 2:
multiIntersectBed -i A_alg1.bed A_alg2.bed -names A1 A2
chr1 100 150 1 A1 1 0
chr1 150 1000 2 A1,A2 1 1
chr1 1000 1200 1 A2 0 1
chr1 5000 6000 1 A1 1 0
chr1 7000 8000 1 A2 0 1
multiIntersectBed -i A_alg1.bed A_alg2.bed -names A1 A2 | awk '$4>1'
chr1 150 1000 2 A1,A2 1 1
I hope this helps.
One possible solution is the awk command that aaronQuinlan has suggested below. This would split your original file into bed files grouped by the first and fifth column. Is this what you are after?
This looks useful. But most of the examples on the bedtools page seem to be good for intersecting bed files without pre-specifying that another column (like ID) should match. I have around 7k samples and wish to make sure that merges are only done in cases where samples match. Am I missing something? Could you possibly give an example of how to carry out the function above only when a fourth column matches?
I think I can imagine a solution to this that uses multiintersectBed and awk as below. It will make a lot of temporary files, but those can be cat-ed together at the end.
Thanks.