If you have the sequences in UCSC BED format, you can use BEDOPS convert2bed
to convert gene annotations to BED, bedops
to make a file of gene promoters (say, a region 500 nt upstream of the gene TSS), and bedmap
to associate sequences with the promoters of genes.
For example, to get some gene annotations and write them to a BED-formatted file:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.basic.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "gene"' \
| convert2bed -i gff - \
> genes.bed
Or replace this step with whatever annotation source you prefer, for your reference genome.
Make a file of promoter regions from the genes:
$ awk -v OFS="\t" '($6 == "+") { print $1, $2, ($2+1), $4; }' genes.bed | bedops --range -500:0 --everything - > promoters.for.bed
$ awk -v OFS="\t" '($6 == "-") { print $1, ($3 - 1), $3, $4; }' genes.bed | bedops --range 0:500 --everything - > promoters.rev.bed
$ bedops --everything promoters.for.bed promoters.rev.bed > promoters.bed
Sort your BED-formatted sequences:
$ sort-bed sequences.unsorted.bed > sequences.bed
Map sequences to gene promoters:
$ bedmap --echo --echo-map --delim '\t' promoters.bed sequences.bed > answer.bed
Each line of the file answer.bed
contains a promoter region, its associated gene ID, and any sequences that overlap the gene's promoter.