Question

Counting read orientation between ranges in a bed file

0

Entering edit mode

8.1 years ago

EVR ▴ 610

HI,

I have two bed file like follows:

bed_A

scaffold11296   36365   36414
scaffold11296   36471   36526

bed_B

    scaffold11296   36302   36334   -
    scaffold11296   36303   36334   +
    scaffold11296   36339   36370   +
    scaffold11296   36340   36369   +
    scaffold11296   36365   36395   -
    scaffold11296   36366   36394   -
    scaffold11296   36367   36395   -
    scaffold11296   36368   36395   -
    scaffold11296   36394   36414   -
    scaffold11296   36471   36502   +
    scaffold11296   36483   36516   +
    scaffold11296   36495   36526   +
    scaffold11296   40892   40909   +
    scaffold11296   40892   40909   +
    scaffold11296   40892   40909   +
    scaffold11296   40892   40909   +

I would like to count "+" orientated reads, if any, between the ranges specified in file "bed_A" file. I expect a output like follows:

    scaffold11296   36365   36414      0
    scaffold11296   36471   36526      3

Any guidance would be really useful. thanks in advance

BED bedtools bedops • 1.5k views

ADD COMMENT • link updated 8.1 years ago by GouthamAtla 12k • written 8.1 years ago by EVR ▴ 610

1

Entering edit mode

One way, via bedmap and bedops:

$ awk '$4=="+"' B.bed | bedmap --echo --count --delim '\t' A.bed - > count.for.bed

Another way, via a bash script that just uses bedops:

#!/bin/sh

referenceFn=A.bed
mapFn=B.bed
answerFn=count.for.bed

while read line
do
    echo -e ${line} \
        | bedops --element-of 1 ${mapFn} - \
        | awk -vline=${line} ' \
            BEGIN { plus = 0; } \
            { if ($4=="+") { plus++; } } \
            END { print line"\t"plus; }' - \
        > ${answerFn}
done < ${referenceFn}

The first approach via bedmap will likely run somewhat faster. It is also (to my eyes) a lot more readable.

On the other hand, the second approach that works on each line of the reference file (A.bed) gives you more control. With this approach, for instance, you could count overlaps of both forward- and reverse-stranded elements in one pass (as noted in the answer to your forum post):

#!/bin/sh

referenceFn=A.bed
mapFn=B.bed
answerFn=count.both.bed

while read line
do
    echo -e ${line} \
        | bedops --element-of 1 ${mapFn} - \
        | awk -vline=${line} ' \
            BEGIN { plus = 0; minus = 0; } \
            { if ($4=="+") { plus++; } else { minus++; } } \
            END { print line"\t"plus"\t"minus; }' - \
        > ${answerFn}
done < ${referenceFn}

Yet another UNIX-based way to do this would be:

$ awk '$4=="+"' B.bed | bedmap --count A.bed - > count.for.txt
$ awk '$4=="-"' B.bed | bedmap --count A.bed - > count.rev.txt
$ paste A.bed count.for.txt count.rev.txt > count.both.bed

ADD REPLY • link 8.1 years ago by Alex Reynolds 36k

0

Entering edit mode

But you example output does not seems correct:

scaffold11296 36365 36414 overlaps with two + regions from file B.

scaffold11296   36339   36370   +
scaffold11296   36340   36369   +

ADD REPLY • link 8.1 years ago by GouthamAtla 12k

score 0 · Answer 1 · 2016-10-19

0

Entering edit mode

8.1 years ago

harold.smith.tarheel ★ 5.0k

The simplest approach is probably to generate a version of bed_B that contains only the '+' intervals (e.g., using grep), then use BEDtools 'annotate' with the -count flag.

ADD COMMENT • link 8.1 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Good idea. But it would be also great if any other method to achieve what I want.

ADD REPLY • link 8.1 years ago by EVR ▴ 610

0

Entering edit mode

Did this approach not produce the intended result? Or are you trying to do something more sophisticated/different than your original question?

ADD REPLY • link 8.1 years ago by harold.smith.tarheel ★ 5.0k

score 0 · Answer 2 · 2016-10-20

intersectBed -a B.bed -b A.bed -wa -wb |
    groupBy -g 1,4,6,7 -c 4 -o count | 
    awk -v OFS="\t" '{ print $1,$3,$4,$2,$5}'

Output:

scaffold11296   36365   36414   +   2
scaffold11296   36365   36414   -   5
scaffold11296   36471   36526   +   3

You can restrict the strand to be printed in awk command.