You can do an asymmetric padding around strand-separated TSSs with bedops --range
:
$ gff2bed < genes.gff > genes.bed
$ awk -vFS="\t" -vOFS="\t" '($6 == "+"){ print $1, ($2 - 1), $2, $4, $5, $6; }' genes.bed \
| bedops --range -400:50 --everything - \
> promoters.for.bed
$ awk -vFS="\t" -vOFS="\t" '($6 == "-"){ print $1, $3, ($3 + 1), $4, $5, $6; }' genes.bed \
| bedops --range -50:400 --everything - \
> promoters.rev.bed
$ bedops --everything promoters.for.bed promoters.rev.bed > promoters.bed
One advantage of using a toolkit like BEDOPS over awk
is that bedops
deals with the left-edge case, where the upstream edge would be reported as less-than-zero, without bounds checking.
The right edge rarely matters, although you could pipe results to a genome-specific bounds file generated with UCSC Kent Utilities fetchChromSizes
, in order to filter out results that go outside the chromosome bounds. For instance, for hg38
:
$ fetchChromSizes hg38 \
| awk -vOFS="\t" '{ print $1, "0", $2; }' \
| grep -vE '_' \
| sort-bed - \
> hg38.nuc.bed
$ bedops --element-of 100% promoters.bed hg38.nuc.bed > promoters.filtered.bed
BEDOPS is (purposefully) agnostic about genomes, which keep changing. Doing this manually is a little more work, but then you know exactly how your data was generated, which leads to fewer questions about results.
You would then run your BED-formatted promoters —filtered or not— through something like scripted calls to samtools faidx
, to convert to strand-aware, FASTA-formatted sequence, using your assembly of choice. I outline one such approach in my Stack Exchange answer here: https://bioinformatics.stackexchange.com/a/5374/776
You can define your promoters any way that you like. I've seen literature go 1kb to 5kb upstream and 0 to 500nt downstream of the TSS. It is up to you and the problem you are trying to explore.
Polyadenylation is a posttranscriptional modification which means there is no polyA-tail in the genome for every transcript. PolyA tails are added by special enzymes to the transcript but not transcriped directly from the genome. Therefore that cannot be the cause for any nucleotide enrichment you see. Maybe some more details about the motif would help.
Title of this question is misleading in the context of the contents of the post. You may want to change that to reflect the actual question.
It sounds like the question you have is about interval selection being of arbitrary size.