Perhaps the following script (with modifications specific to the genes you want to use) will help.
This bash
script combines BEDOPS bedmap
, bedops
, and gtf2bed
with awk
processing steps to conditionally process promoter regions, where there are overlaps with gene annotations.
#!/bin/bash
GENES=Arabidopsis_thaliana.TAIR10.37.genes.bed
wget -qO- ftp://ftp.ensemblgenomes.org/pub/release-37/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.37.gtf.gz | gunzip -c | awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' | gtf2bed - | awk '($8=="gene")' > ${GENES}
awk -vFS="\t" -vOFS="\t" '($6=="+")' ${GENES} > ${GENES}.for
awk -vFS="\t" -vOFS="\t" '($6=="-")' ${GENES} > ${GENES}.rev
TSS=Arabidopsis_thaliana.TAIR10.37.tss.bed
awk -vFS="\t" -vOFS="\t" '{ print $1, ($2-1), $2, $4, $5, $6, $7, $8, $9, $10 }' ${GENES}.for > ${TSS}.for
awk -vFS="\t" -vOFS="\t" '{ print $1, $3, ($3+1), $4, $5, $6, $7, $8, $9, $10 }' ${GENES}.rev > ${TSS}.rev
WINDOW=2000
PROMOTERS=Arabidopsis_thaliana.TAIR10.37.promoters.bed
bedops --range -${WINDOW}:0 --everything ${TSS}.for > ${PROMOTERS}.for
bedops --range 0:${WINDOW} --everything ${TSS}.rev > ${PROMOTERS}.rev
UPSTREAM_FILTERED_PROMOTERS=Arabidopsis_thaliana.TAIR10.37.promoters.upstreamFiltered.bed
bedmap --count --echo --echo-map-range ${PROMOTERS}.for ${GENES}.for | awk -vFS="|" -vOFS="\t" '{ if ($1==0) { print $2; } else { m=split($2,a,"\t"); tssStart=a[2]; tssEnd=a[3]; n=split($3,b,"\t"); geneEnd=b[3]; if ((geneEnd < tssEnd) && (geneEnd >= tssStart)) { print a[1], geneEnd, tssEnd, a[4], a[5], a[6], a[7], a[8], a[9], a[10]; } } }' > ${UPSTREAM_FILTERED_PROMOTERS}.for
bedmap --count --echo --echo-map-range ${PROMOTERS}.rev ${GENES}.rev | awk -vFS="|" -vOFS="\t" '{ if ($1==0) { print $2; } else { m=split($2,a,"\t"); tssStart=a[2]; tssEnd=a[3]; n=split($3,b,"\t"); geneStart=b[2]; if ((geneStart >= tssStart) && (geneStart < tssEnd)) { print a[1], tssStart, geneStart, a[4], a[5], a[6], a[7], a[8], a[9], a[10]; } } }' > ${UPSTREAM_FILTERED_PROMOTERS}.rev
#
# cleanup
#
bedops --everything ${UPSTREAM_FILTERED_PROMOTERS}.for ${UPSTREAM_FILTERED_PROMOTERS}.rev > ${UPSTREAM_FILTERED_PROMOTERS}
rm ${GENES}.for ${GENES}.rev ${TSS}.for ${TSS}.rev ${PROMOTERS}.for ${PROMOTERS}.rev ${UPSTREAM_FILTERED_PROMOTERS}.for ${UPSTREAM_FILTERED_PROMOTERS}.rev
The input to this pipeline is a BED file called ${GENES}
. You could replace the Ensembl A. thaliana annotations with any BED file of genes.
Note: This input BED file comes from a GTF file and so has ten columns. If this input BED file has fewer than ten columns, then the awk
statements will need adjustment to report fewer columns.
The output is a set of promoter regions in ${UPSTREAM_FILTERED_PROMOTERS}
, where the upstream region of the 2kb promoter is clipped by any overlapping genes, and the gene is extended to the upstream-most edge of the promoter.
The definition of upstream here is strand-specific. So forward- and reverse-stranded annotations will have respective directionality of upstream.
This script breaks everything down by strand and by annotation category. You could comment out the cleanup
section if you want to investigate the contents of each step, to verify that this works for your problem.
Thanks for your reply. I know how to remove any coding region from the upstream regions, but the problem is that I only want the region from the transcription start site up to the closest upstream gene, and nothing else upstream. I would need some way to "anchor" the feature to the TSS but I do not know how to do that.
example:
upstream region:
overlapping gene:
If I remove it, I am left with:
And I want to retrieve only:
Do you have a situation where a 2kb window upstream of a gene can overlap another 2kb window from another gene?
Yes, that is my problem. Maybe it's not so common in human genome, but I can find examples very easily in Arabidopsis.
Okay, I provided a solution that should work reasonably generically with any set of annotations, including yours. Feel free to give it a shot if you like.