Question

Count Columns For Each Window

0

Entering edit mode

10.7 years ago

roll ▴ 350

I would like to count columns for the matching windows (that is in bed format). I tried awk but too slow.

For example, I have the following two windows

chr1:100-1000
chr1:1500-3000

Fort these 2 windows I found the following matches and would like to count 6th column based on whether the last column is 1 or 0.

chr1 100 1000 chr1 200 0 1 0
chr1 100 1000 chr1 500 0 4 0
chr1 100 1000 chr1 700 0 6 1
chr1 1500 3000 chr1 2000 0 9 1
chr1 1500 3000 chr1 2000 0 1 0

My desired results will be

chr1 100 1000 6/11
chr1 1500 3000 9/10

I tried doing this with awk but since i have millions of entry it is very slow. Is there a bioinformatics tool that do such counts? Or any faster solution?

awk • 3.6k views

ADD COMMENT • link updated 10.7 years ago by dariober 15k • written 10.7 years ago by roll ▴ 350

0

Entering edit mode

So let me try to understand this from what I see in your example. Let's say for window of chr 100 1000, you sum up all the numbers in the seventh column that corresponds to this window (hence giving the final value of 11), and then for all those rows, you are looking for the rows that has value of 1 in the final column. Is that right?

ADD REPLY • link 10.7 years ago by youngcsong ▴ 100

0

Entering edit mode

Also, I assume this is tab-delimited file?

ADD REPLY • link 10.7 years ago by youngcsong ▴ 100

0

Entering edit mode

yes exactly. My attempt was to have a list of all the windows and use while loop in perl. And awk to count the counts. But it is very slow.

ADD REPLY • link 10.7 years ago by roll ▴ 350

0

Entering edit mode

Can you have more than one rows with last column value of 1? If so, do you add those values together (so, if the first line in your example had last column value of 1, would this be 7/11?

ADD REPLY • link 10.7 years ago by youngcsong ▴ 100

0

Entering edit mode

yes you can have more than 1. The example you are giving is correct.

ADD REPLY • link 10.7 years ago by roll ▴ 350

score 3 · Answer 1 · 2014-03-21

Use awk to split your BED data into two files:

The first file contains all 0/1 overlaps (where the eighth column is either 0 or 1), with the seventh column moved to the fifth (score) column
The second file contains 1-only overlaps (where the eighth column is 1), with the seventh column moved to the fifth (score) column

Let's call these two files allOverlaps.bed and oneOverlaps.bed. Make sure they are sorted per BEDOPS sort-bed:

$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$7}' overlaps.bed | sort-bed - > allOverlaps.bed
$ awk '{if ($8==1) {print $1"\t"$2"\t"$3"\t"$4"\t"$7}}' overlaps.bed | sort-bed - > oneOverlaps.bed

Set up a third, tab-delimited, sorted BED file called windows.bed, with your windows of interest:

chr1    100    1000
chr1    1500   3000
...

Here's a one-liner where you apply two BEDOPS bedmap operations in piped, serial fashion:

$ bedmap --echo --sum --delim '\t' windows.bed oneOverlaps.bed | bedmap --echo --sum --delim '/' - allOverlaps.bed > answer.bed

The file answer.bed should follow your specified output format, I think.

To understand how this works, you are applying the --sum operand on the score column of any mapped overlaps (as a result, note that this allows multiple rows with 1 in the last column). By doing this map operation on each category of overlaps, and then by piping results from the first set to the next, we get the final answer.

Should be speedy, I'd think, though awk is tough to beat in straight-up parsing. The sort-bed tool is much faster than alternatives and the bedmap application works very efficiently with BED files (and since your data are already in that format, more or less, why not use the tools to match). We've got some performance enhancements coming soon to make that even faster.

score 2 · Answer 2 · 2014-03-21

2

Entering edit mode

10.7 years ago

Michael 55k

#!/usr/bin/env perl
use strict;
use warnings;

my %store = ();

while (<>) {
  chomp;
  next if /^\s*$/;
  my @l = split;
  my $coord = $l[0].':'.$l[1].':'.$l[2]; ## your chromosome names better don't contain ':'
  $store{$coord} = [0,0] unless (exists $store{$coord});
  $store{$coord}->[0] += $l[6];
  $store{$coord}->[1] += $l[6] if $l[7] == 1;
}

while (my ($k, $v) = each %store) {
  print join "\t", ( (split (':', $k)), $v->[1].'/'.$v->[0]), "\n";
}
__END__

Fast enough?

ADD COMMENT • link 10.7 years ago by Michael 55k

0

Entering edit mode

That sure was fast...I was thinking of a same approach, except I am using python.

ADD REPLY • link 10.7 years ago by youngcsong ▴ 100

0

Entering edit mode

Fine, post yours, and op can do the speed test ;)

ADD REPLY • link 10.7 years ago by Michael 55k

0

Entering edit mode

Hmm...16 lines of Perl that only uses while loop vs. 78 lines of python codes containing for loops...I think there already is a winner before the speed test has even begun. You sir, has already put my coding skill to shame :)

ADD REPLY • link 10.7 years ago by youngcsong ▴ 100

score 1 · Answer 3 · 2014-03-22

Not sure if I completely understand the question, but I'll give it a shot...

If the starting point is a bed file of target regions already intersected with the "score" regions, like this:

cat b.bed
chr1 100 1000 chr1 200 0 1 0
chr1 100 1000 chr1 500 0 4 0
chr1 100 1000 chr1 700 0 6 1
chr1 1500 3000 chr1 2000 0 9 1
chr1 1500 3000 chr1 2000 0 1 0

Then with awk & bedtools you can do:

awk '{print $0 "\t" $7 * $8}' b.bed \
| groupBy -g 1,2,3 -c 7,9 -ops sum,sum \
| awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$5"/"$4}'

(The second awk is just to have the output formatted exactly as in the OP's example).

If instead the starting point is a bed file of target regions (w.bed) and a bed file of score intervals (a.bed), create the b.bed file above with:

## Example input files
cat w.bed
chr1 100 1000
chr1 1500 3000

cat a.bed 
chr1 199 200 0 1 0
chr1 499 500 0 4 0
chr1 699 700 0 6 1
chr1 1999 2000 0 9 1
chr1 1999 2000 0 1 0

## Create b.bed above:
intersectBed -b a.bed -a w.bed -wa -wb | cut -f1-4,6- > b.bed

score 0 · Answer 4 · 2014-03-21

Well, nonetheless, here's my code...big apologies for stealing someone's thread, but a critic here would help tons...

from optparse import OptionParser
from os import listdir
import os, getpass, sys, string, re

if __name__ == '__main__':
    parser = OptionParser(usage='%prog ...')
    parser.add_option("-t", "--table_file", help="Variation table extracted from VCF file (i.e. 4th and 5th columns of VCF file)")

    (options, args) = parser.parse_args()

    tableFile = options.table_file

    bedHandle = open(tableFile,"rb")

    # This is where our results go in
    compareDict = {}

    totalDenominator = 0
    totalNumerator = 0 

    # Let's process the first line
    bedFirstLine = bedHandle.readline()
    bedFirstLine = string.strip(bedFirstLine)

    bedFirstTab = bedFirstLine.split("\t")

    # This value gets added to denominator and also to the numerator if the switch is equal to one
    firstValue = bedFirstTab[6]
    totalDenominator = firstValue

    firstSwitch = bedFirstTab[7]

    # If the switch is equal to one, we add the corresponding value to the numerator
    if int(firstSwitch) == 1:
        totalNumerator += firstValue

    # Our key to the dictionary consists of the chromosome number, first coordinate and second coordinate of the window, all concatenated by "_"
    compareKey = bedFirstTab[0] + " \t" + bedFirstTab[1] + " \t" + bedFirstTab[2]

    compareDict[compareKey] = (totalNumerator, totalDenominator)

    # We now process the subsequent lines by using the methods similar to the one that we applied to the first line
    for bedNextLines in bedHandle.readlines():
        bedNextLines = string.strip(bedNextLines)
        bedNextTab = bedNextLines.split("\t")
        keyCompare = bedNextTab[0] + " \t" + bedNextTab[1] + " \t" + bedNextTab[2]
        tempValue = bedNextTab[6]
        tempSwitch = bedNextTab[7]

        if compareDict.has_key(keyCompare):
            currEntryList = compareDict[keyCompare]
            currNumerator = int(currEntryList[0])
            currDenominator = int(currEntryList[1])

            currDenominator += int(tempValue)

            if int(tempSwitch) == 1:
                currNumerator += int(tempValue)

            compareDict[keyCompare] = (currNumerator, currDenominator)
        else:
            keyCompare = bedNextTab[0] + " \t" + bedNextTab[1] + " \t" + bedNextTab[2]
            nextValue =  bedNextTab[6]
            totalDenominator = int(nextValue)

            nextSwitch = int(bedNextTab[7])

            if int(nextSwitch) == 1:
                totalNumerator += int(nextValue)


            compareDict[keyCompare] = (totalNumerator, totalDenominator)

  for eachKey in compareDict.keys():
      print eachKey,"\t",compareDict[eachKey][0],"/",compareDict[eachKey][1]