Question

Bedtools: Top N Most Similar Regions When Comparing Two Bed/Wig/Bam Files?

5

Entering edit mode

12.8 years ago

Ahdf-Lell-Kocks ★ 1.6k

Is there an easy way of finding, probably with bedtools, given a window size, the top N most correlated regions when comparing two bed/wig files? For example, in comparing two bed/wig/bam files that have PolII data for 2 conditions, to give the top N windows where the wiggle profiles are most similar?

bedtools correlation • 4.9k views

ADD COMMENT • link updated 12.8 years ago by Aaronquinlan 12k • written 12.8 years ago by Ahdf-Lell-Kocks ★ 1.6k

2

Entering edit mode

depends what you mean by "easy" you could do a binomial test for each window. it'd be simpler to answer your question if you have a few lines of example data.

ADD REPLY • link 12.8 years ago by brentp 24k

score 8 · Answer 1 · 2012-02-14

If I understand the problem correctly, here's a basic workflow that may work for you. One basic assumption I am making here is that by PolII "data", you mean alignments from a ChIP-seq experiment.

1. Convert your WIG files to BEDGRAPH with the venerable Jim Kent's wigToBigWig and bigWigToBedGraph.

  wigToBigWig polII.condA.wig chrom.sizes polII.condA.bw
  wigToBigWig polII.condB.wig chrom.sizes polII.condB.bw

  bigWigToBedGraph polII.condA.bw chrom.sizes polII.condA.bedg
  bigWigToBedGraph polII.condB.bw chrom.sizes polII.condB.bedg

2. Make a BED file of non-overlapping 100kb windows. I chose 100Kb. These will end up being the windows from which you are assessing the correlation b/w the two conditions.

  bedtools makewindows -g chrom.sizes -w 100000 > windows.bed

3. Measure the coverage in the BEDGRAPH files at each 100Kb window. This step uses a new (unreleased) tool called "map" (think functional programming), that summarizes data for intervals in one file (here, the "score/depth" column from your BEDGRAPHs) based on overlaps with another file (i.e., windows.bed). In other words, we are summing the score column for each BEDGRAPH interval that overlaps a window in windows.bed. Column 4 (-c 4) is the score/depth column, and "-o sum" is the operation that should be applied to that column. NOTE: both the BEDGRAPH and windows.bed files must be sorted by chrom, then start. If this workflow seems roughly like what you want, you can grab the "map" function from the bedtools GitHub repository.

  bedtools map -a windows.bed -b polII.condA.bedg -c 4 -o sum \
           > polII.condA.windows.bedg
  bedtools map -a windows.bed -b polII.condB.bedg -c 4 -o sum \
           > polII.condB.windows.bedg

4. Combine the two files tow create a single file with the intervals and the values from each condition.

 paste <(cut -f 1-4 polII.condA.windows.bedg) \
       <(cut -f 4 polII.condB.windows.bedg) \
       > polII.conditions.windows.txt

 # the output will look something like this, where the first three columns 
 # are the windows, and the last two columns are the scores 
 # from conditions A and B, respectively. E.g.:
 chr1    0        100000    163    172
 chr1    100000   200000    313    94
 chr1    200000   300000    1      774
 ...

5. Use the counts above (columns 4 and 5) to compute a measurement of correlation. I'm not sure of what the right measure is here, so I will name this piece your script, but in principle, assuming you have some score as the 6th column, you could just use sort and head to pick the top N values. Perhaps a more poweful alternative would be to load this file into R and use it to assess correlation, etc.

 N=10
 yourscript.[pl|.py|.R|.rb|.*] polII.conditions.windows.txt | \
    sort -k6,6nr | \
    head -n $N | \
    > table1.naturepaper.txt