I do see a one-base snippet:
% echo -e "ND_123.1\t74801\t76802" | bedops --chop 1000 -
ND_123.1 74801 75801
ND_123.1 75801 76801
ND_123.1 76801 76802
If we treat ND_123.1:74801-76802
as a zero-based-indexed interval, then it is 2001 bases long:
% calc '76802-74801'
2001
It won't divide into 1kb windows neatly, so something is up with how this interval is generated.
One cheap way to fix this is to use -x
with --chop <N>
. This extra option filters out intervals that are not N
bases long:
% echo -e "ND_123.1\t74801\t76802" | bedops --chop 1000 -x -
ND_123.1 74801 75801
ND_123.1 75801 76801
But that is hiding the problem. I wouldn't use this.
Another is to perhaps pre-process these intervals so that they are sized in 1k units:
% echo -e "ND_123.1\t74801\t76802" | awk -v FS="\t" -v OFS="\t" '{ print $1, $2, ($3 - 1) }' | bedops --chop 1000 -
ND_123.1 74801 75801
ND_123.1 75801 76801
But that just hides the problem in a different way.
Another way would be to look at what the awk
part of the one-liner is doing:
% gff2bed < genes.gff | awk -v FS="\t" -v OFS="\t" '($6 == "+"){ print $1, ($2 - 1), $2, $4, $5, $6; }'
The gff2bed
script changes GFF's one-based index to zero-based in the BED-formatted output. So the above should be a correct way of getting a set of single-base, forward-stranded TSS intervals.
In any case, let's say our TSS of interest is at 0-indexed position ND_123.1:76801:76802
. This TSS starts at position 76801
on contig `ND_123.1, is one base in length, and is oriented on the forward strand.
We want the 2kb-sized window upstream of this position, oriented on the forward strand.
There are two different sets of windows we can construct.
One window does not include the TSS interval:
% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -2000:-1 -
ND_123.1 74801 76801
This is the solution you post in your question:
... | bedops --range -$WINDOW:-1 --everything - | ...
The other window includes the TSS interval:
% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -1999:0 -
ND_123.1 74802 76802
Either way, both of these windows are 2kb long. Chopping either of these by 1kb increments gives exactly two 1kb windows:
% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -1999:0 - | bedops --chop 1000 -
ND_123.1 74802 75802
ND_123.1 75802 76802
% echo -e "ND_123.1\t76801\t76802" | bedops -u --range -2000:-1 - | bedops --chop 1000 -
ND_123.1 74801 75801
ND_123.1 75801 76801
So which of these two solutions you pick depends upon whether you want the TSS interval in the output, or not. That will depend on what you're doing with these 1kb windows, probably.
Note that the --range
operation will need to be flipped in the case of reverse-strand oriented TSSs. If we treat the example as reverse strand-oriented, then we might do either of the following:
% echo -e "ND_123.1\t76801\t76802" | bedops -u --range 1:2000 -
ND_123.1 76802 78802
% echo -e "ND_123.1\t76801\t76802" | bedops -u --range 0:1999 -
ND_123.1 76801 78801
Which parameters are used depends on whether you want to exclude or include the TSS, respectively.
Tl;dr: If what I posted in a previous answer about making specifically-sized windows is different from the above, then my previous answer is incorrect and should not be used. However, the answer that greyman refers to in his comment will include the TSS and genomic space both upstream and downstream of the TSS. It might be correct in that use case, such as for calculating aggregate signal over the TSS, where it can make sense to define 400 bases upstream, one base for the TSS, and 50 bases downstream. In that case, one might line up signal vectors and take the mean (or some other statistic), and it doesn't really matter in that case that there is an "extra" base. (And, in fact, keeping the TSS in helps with plotting.) What matters here — for your use case — is that you need a precise size of windows for chopping up, so your solution will work.
Is filtering out small intervals (like 1bp sequence) help?
And I think in your code, for the forward strand you need to subtract 1 from end of the peak
Hi , I would like to use bedops to make a standardize workflow.
In this post I saw the author of bedops did this before merging both files.