Looking at your conditions, it sounds like you always want the closest downstream element, regardless of what's upstream or how near or far away that upstream element is.
By default, the BEDOPS closest-features
application will report the full characteristics of both the nearest leftmost ("upstream") and rightmost ("downstream") elements.
You can therefore feed the output from this application to awk
(or another interpreted language) to just report the TSS element and whatever nearest gene is downstream from it.
As a for instance, first sort your input files:
$ sort-bed unsorted.TSS.bed > TSS.bed
$ sort-bed unsorted.genes.bed > genes.bed
Then run the two sorted BED files through closest-features
, piping the results to awk
with the correct field separator to pick the downstream gene:
$ closest-features TSS.bed genes.bed \
| awk FS="|" '{ \
tssElement = $1; \
upstreamGene = $2; \
downstreamGene = $3; \
print tssElement"|"downstreamGene; \
}' - \
> answer.bed
The file answer.bed
will be a sorted BED file that contains results in the following format:
[ tss-1 ] | [ nearest-downstream-gene-to-tss-1 ]
[ tss-2 ] | [ nearest-downstream-gene-to-tss-2 ]
...
[ tss-n ] | [ nearest-downstream-gene-to-tss-n ]
If you have a different condition for picking a nearest element, you could set up those rules in the awk
block above. (Feel free to modify your question if you had something else in mind.)
To test features on the basis of strand, you could modify the awk
block as follows:
$ closest-features TSS.bed genes.bed \
| awk FS="|" '{ \
tssRegion = $1; \
upstreamGene = $2; \
downstreamGene = $3; \
split(tssRegion, tssElements, "\t"); \
split(upstreamGene, upstreamGeneElements, "\t"); \
split(downstreamGene, downstreamGeneElements, "\t"); \
tssStart = tssElements[1]; \
tssStop = tssElements[2]; \
upstreamGeneStart = upstreamGeneElements[1]; \
upstreamGeneStop = upstreamGeneElements[2]; \
upstreamGeneStrand = upstreamGeneElements[4]; \
downstreamGeneStart = downstreamGeneElements[1]; \
downstreamGeneStop = downstreamGeneElements[2]; \
downstreamGeneStrand = downstreamGeneElements[4]; \
trueUpstreamGeneDistance = 0; \
trueDownstreamGeneDistance = 0; \
if (upstreamGeneStrand == "+") { \
trueUpstreamGeneDistance = tssStart - upstreamGeneStart; \
} \
else { \
trueUpstreamGeneDistance = tssStart - upstreamGeneStop; \
} \
if (downstreamGeneStrand == "+") { \
trueDownstreamGeneDistance = downstreamGeneStart - tssStop; \
} \
else { \
trueDownstreamGeneDistance = downstreamGeneStop - tssStop; \
} \
if (trueUpstreamGeneDistance > trueDownstreamGeneDistance) { \
print tssRegion"|"downstreamGene; \
} \
else { \
print tssRegion"|"upstreamGene; \
} \
}' - \
> answer.bed
Hmm, I had looked into closest-features, but this method won't work on:
TSS.bed
genes.bed
The result associates TSSE with geneF (instead of geneE, as desired).
However, I think by combining Istvan's answer and your answer, I believe I need to use bedops closest-features to get the nearest genes and then roll my own script to pick correctly between the two nearest genes.
Neither of those two inputs are correct BED (per specification). You'll probably want to fix them before passing them to apps or scripts.
Oops, fixed now. (I'll need to add "chr" in, run bedops, strip out chr later so I can combine the data with my vcf) :P But... that's a separate issue.
It's not clear to me what your rules are, such that
geneE
would be picked. But you can certainly add logic inside theawk
block (or whichever language you prefer instead) to pick an element based on rules of your choice.Since TSSE has no strand info, it could be acting on either the + or - strand? TSSE is upstream of geneE on the - strand and upstream of geneF on the + strand, but it is closer to geneE on the - strand than it is to geneF on the + strand.
That's easy, then. You use a
split()
call ontssElement
,upstreamGene
anddownstreamGene
within theawk
block (the TSS and both genes are delimited with tab characters), look at the strand and start/stop coordinate values of each element, and then decide accordingly and print.Thanks for this solution, very helpful. Though it seems like awk cannot take the vertical bar (
|
) as a field separator. I get the following error when I try to run the first simple script you've written above:Is this just a matter of having to change how the closest-features output is delimited?
You might try a
BEGIN { FS = "|"; }
block in theawk
script. You could also certainly choose a different delimiter inclosest-features
.Dear Alex, I have Active Enhancer peaks and I want to assign each peak to nearest TSS (downstream or upstream). My sorted TSS.bed looks like:
My sorted genes.bed (active enhancer peak file) looks like:
I tried closest-features TSS.bed genes.bed > answer.bed and result are:
So as you can see 1 entry from genes.bed is assigned against multiple entries of TSS.bed whereas what I want is 1-to-1 mapping. Although it is possible that 2 or 3 different TSS are close to one gene but in my case initial 25 TSS have the same gene and I think I am doing something wrong here (I actually don't know how to modify the awk for my case). Thank you.
If you want the output of nearest TSS to gene/peak, then reverse the order of arguments:
closest-features genes.bed TSS.bed
:In the reversed case, your
input-file
becomesgenes.bed
andquery-file
becomesTSS.bed
. Each gene/peak gets two elements fromTSS.bed
that fall nearest to it, upstream and downstream.In your first example, there are two peaks shown for each TSS. The first is
NA
because there are no peaks upstream of the first few TSS sites; the nearest peaks are all downstream. In any case, it sounds like you want the other result: the nearest TSSs to each peak, so reversing the order of file arguments should address that.