How to extend variable length intervals to the same final length?
1
1
Entering edit mode
7.7 years ago
Ian 6.1k

I have a large set of footprint intervals that range from 11 to 25bp For the purpose of motif discovery I would like to extend all intervals to, for example, 50bp. Intervals should be extended equally from both sides. I would usually use 'bedtools slop' for fixed length intervals, but this would not appear to work with variable length.

It would be great if anyone could advise me how to use bedtools, or something else. I have a nagging feeling I am missing something obvious, so apologies in advance!

bedtools slop intervals footprint • 2.8k views
ADD COMMENT
4
Entering edit mode
7.7 years ago

Here's a way that I think should extend both ends of BED elements to the desired target length:

$ TARGET_LENGTH=50
$ awk -vF=${TARGET_LENGTH} 'BEGIN{ OFS="\t"; }{ len=$3-$2; diff=F-len; flank=int(diff/2); upflank=downflank=flank; if (diff%2==1) { downflank++; }; print $1, $2-upflank, $3+downflank; }' in.bed | sort-bed - > out.bed

Non-even length elements or a non-even target length will require flank lengths that are unequal. Sounds like this is not a problem.

You might adjust the logic to randomly pick which of upflank or downflank to decrement or increment in this case, so that you don't impart a bias from this adjustment (esp. if original elements are stranded, like footprints that will ultimately be mapped to TF binding sites or other stranded elements), e.g.:

$ TARGET_LENGTH=50
$ awk -vF=${TARGET_LENGTH} 'BEGIN{ OFS="\t"; }{ len=$3-$2; diff=F-len; flank=int(diff/2); upflank=downflank=flank; if (diff%2==1) { if (rand() >= 0.5) { downflank++; } else { upflank--; } }; print $1, $2-upflank, $3+downflank; }' in.bed | sort-bed - > out.bed
ADD COMMENT
0
Entering edit mode

Thank you for your answer! I was going to ask how it handles odd lengths. It is OK if one side has an extra base, as long as the final length is the same.

ADD REPLY
0
Entering edit mode

Thanks for the addition. After discussing this with a colleague this morning it was pointed out that finding the mid-point of each region and then extending out works equally well. I knew I had missed something!

ADD REPLY
0
Entering edit mode

Yes, either way gets you to the same answer, but you'd still need to shift the midpoint up or down a base when dividing an even-numbered length in half.

ADD REPLY
0
Entering edit mode

Probably an obvious answer Im not thinking of, but why does this code sometimes create intervals slightly outside the target range of 50, like 49 or 51? Thanks!

ADD REPLY
0
Entering edit mode

Dividing diff by 2 when it is an odd number, probably. The int() function chops off the fractional part.

Sounds like this is not a problem for the original question, but might be an issue for you? This is tough because genomic intervals aren't usually fractional, so a decision would have to be made to figure out what to do.

As one approach, perhaps just check first if the total interval length would be one more or less than the target length, and subtract or add one to either the left or right flank, as needed (perhaps flipping a coin to decide which) to get to the target length.

ADD REPLY

Login before adding your answer.

Traffic: 1674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6