Counting read orientation between ranges in a bed file
2
0
Entering edit mode
8.2 years ago
EVR ▴ 610

HI,

I have two bed file like follows:

bed_A

scaffold11296   36365   36414
scaffold11296   36471   36526

bed_B

    scaffold11296   36302   36334   -
    scaffold11296   36303   36334   +
    scaffold11296   36339   36370   +
    scaffold11296   36340   36369   +
    scaffold11296   36365   36395   -
    scaffold11296   36366   36394   -
    scaffold11296   36367   36395   -
    scaffold11296   36368   36395   -
    scaffold11296   36394   36414   -
    scaffold11296   36471   36502   +
    scaffold11296   36483   36516   +
    scaffold11296   36495   36526   +
    scaffold11296   40892   40909   +
    scaffold11296   40892   40909   +
    scaffold11296   40892   40909   +
    scaffold11296   40892   40909   +

I would like to count "+" orientated reads, if any, between the ranges specified in file "bed_A" file. I expect a output like follows:

    scaffold11296   36365   36414      0
    scaffold11296   36471   36526      3

Any guidance would be really useful. thanks in advance

BED bedtools bedops • 1.5k views
ADD COMMENT
1
Entering edit mode

One way, via bedmap and bedops:

$ awk '$4=="+"' B.bed | bedmap --echo --count --delim '\t' A.bed - > count.for.bed

Another way, via a bash script that just uses bedops:

#!/bin/sh

referenceFn=A.bed
mapFn=B.bed
answerFn=count.for.bed

while read line
do
    echo -e ${line} \
        | bedops --element-of 1 ${mapFn} - \
        | awk -vline=${line} ' \
            BEGIN { plus = 0; } \
            { if ($4=="+") { plus++; } } \
            END { print line"\t"plus; }' - \
        > ${answerFn}
done < ${referenceFn}

The first approach via bedmap will likely run somewhat faster. It is also (to my eyes) a lot more readable.

On the other hand, the second approach that works on each line of the reference file (A.bed) gives you more control. With this approach, for instance, you could count overlaps of both forward- and reverse-stranded elements in one pass (as noted in the answer to your forum post):

#!/bin/sh

referenceFn=A.bed
mapFn=B.bed
answerFn=count.both.bed

while read line
do
    echo -e ${line} \
        | bedops --element-of 1 ${mapFn} - \
        | awk -vline=${line} ' \
            BEGIN { plus = 0; minus = 0; } \
            { if ($4=="+") { plus++; } else { minus++; } } \
            END { print line"\t"plus"\t"minus; }' - \
        > ${answerFn}
done < ${referenceFn}

Yet another UNIX-based way to do this would be:

$ awk '$4=="+"' B.bed | bedmap --count A.bed - > count.for.txt
$ awk '$4=="-"' B.bed | bedmap --count A.bed - > count.rev.txt
$ paste A.bed count.for.txt count.rev.txt > count.both.bed
ADD REPLY
0
Entering edit mode

But you example output does not seems correct:

scaffold11296 36365 36414 overlaps with two + regions from file B.

scaffold11296   36339   36370   +
scaffold11296   36340   36369   +
ADD REPLY
0
Entering edit mode
8.2 years ago

The simplest approach is probably to generate a version of bed_B that contains only the '+' intervals (e.g., using grep), then use BEDtools 'annotate' with the -count flag.

ADD COMMENT
0
Entering edit mode

Good idea. But it would be also great if any other method to achieve what I want.

ADD REPLY
0
Entering edit mode

Did this approach not produce the intended result? Or are you trying to do something more sophisticated/different than your original question?

ADD REPLY
0
Entering edit mode
8.2 years ago
intersectBed -a B.bed -b A.bed -wa -wb |
    groupBy -g 1,4,6,7 -c 4 -o count | 
    awk -v OFS="\t" '{ print $1,$3,$4,$2,$5}'

Output:

scaffold11296   36365   36414   +   2
scaffold11296   36365   36414   -   5
scaffold11296   36471   36526   +   3

You can restrict the strand to be printed in awk command.

ADD COMMENT

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6