how to find chromosome position ranges common to all files
2
0
Entering edit mode
9.9 years ago

I have a load of bed files which have been filtered according to certain criteria. The bins are the same size in all files but they contain sometimes the same bins after filtering and sometimes different bins. I would like to extract all the common bins. For example below. I have tried bedtools intersect but this will only allow comparison of one sample with all the other samples whereas I want to see what is common to all the samples

File 1:

chrom    chrStart    chrEnd
chr1     1           1001
chr1     11001       12001
chr1     12001       13001

File 2:

chrom    chrStart    chrEnd
chr1     1           1001
chr1     9001        10001
chr1     12001       13001

The output should be:

File 1:

chrom    chrStart    chrEnd
chr1     1           1001
chr1     12001       13001
sequencing genome • 2.7k views
ADD COMMENT
0
Entering edit mode

I don't understand very well.. do you want to extract the "common lines" between two files? If so, you can use this bash command:

comm -12 <( sort file1 ) <( sort file2 )
ADD REPLY
1
Entering edit mode
9.9 years ago

Modifying the example from 4.7. Working with many input files at once with bedops and bedmap to a general solution for N input files (where your N is 204):

$ bedops --everything file1.bed file2.bed ... fileN.bed  \
    | bedmap --count --exact --echo --delim '\t' - \
    | awk -v nFiles=N '$1==nFiles' \ 
    | cut -f2- - \
    | awk '!seen[$0] { print } { ++seen[$0] }' \
    > answer.bed

To break this down:

The bedops --everything command generates a multiset union, which is piped to a bedmap operation that uses --count and --exact to count how many times there is an exact match between one element in the unioned set and all other elements in all input files ("sets"). The --echo operation prints the count and the element to standard output.

We use awk and cut to filter this result to any input element that shows up N times (if it maps N times, we know it is found in common to all N input sets.

The final awk statement filters duplicate elements, leaving us a single copy of elements that exist in common to all N input sets.

As the linked BEDOPS example page describes, this technique avoids calculating cxN-1 comparisons, which can be required with other approaches.

In addition to being efficient, this is a useful general approach, in that you can easily relax the constraints. This solution applies the most stringent requirements: all BED elements in the output set must match exactly, and they must be found in all sets.

However, your experiment might require modifying the mapping step to specify some fractional overlap between elements - instead of --exact matches, you could use bedmap --fraction-* operations - or you could modify the first awk filter step, if you only need some M or greater number of common matches, where 0 < M < N input sets. It's easy to tweak the pipeline to these requirements.

ADD COMMENT
0
Entering edit mode

Would that allow me find the commonalities between 204 files?

ADD REPLY
0
Entering edit mode
9.9 years ago
ashwini ▴ 100
$ cat file1.bed file2.bed .. file204.bed | sort --parallel=30 | uniq -c | awk '{if ($1==204)print}' > common.bed

I am not sure whether this is the efficient way of doing but you can give a try.

ADD COMMENT

Login before adding your answer.

Traffic: 1841 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6