Modifying the example from 4.7. Working with many input files at once with bedops and bedmap to a general solution for N
input files (where your N
is 204):
$ bedops --everything file1.bed file2.bed ... fileN.bed \
| bedmap --count --exact --echo --delim '\t' - \
| awk -v nFiles=N '$1==nFiles' \
| cut -f2- - \
| awk '!seen[$0] { print } { ++seen[$0] }' \
> answer.bed
To break this down:
The bedops --everything
command generates a multiset union, which is piped to a bedmap
operation that uses --count
and --exact
to count how many times there is an exact match between one element in the unioned set and all other elements in all input files ("sets"). The --echo
operation prints the count and the element to standard output.
We use awk
and cut
to filter this result to any input element that shows up N
times (if it maps N
times, we know it is found in common to all N
input sets.
The final awk
statement filters duplicate elements, leaving us a single copy of elements that exist in common to all N
input sets.
As the linked BEDOPS example page describes, this technique avoids calculating cxN-1 comparisons, which can be required with other approaches.
In addition to being efficient, this is a useful general approach, in that you can easily relax the constraints. This solution applies the most stringent requirements: all BED elements in the output set must match exactly, and they must be found in all sets.
However, your experiment might require modifying the mapping step to specify some fractional overlap between elements - instead of --exact
matches, you could use bedmap --fraction-*
operations - or you could modify the first awk
filter step, if you only need some M
or greater number of common matches, where 0 < M
< N
input sets. It's easy to tweak the pipeline to these requirements.
I don't understand very well.. do you want to extract the "common lines" between two files? If so, you can use this bash command: