I'm not sure there is any one tool that will do all of this for you. Perhaps some of the following might help.
After downloading genes for mm10
, construct a list of windows upstream or centered on TSSs, and overlap them or associate them with your coordinates:
$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "gene"' - \
| convert2bed -i gff - \
> gencode.vM25.genes.bed
Then:
$ bedmap --echo --echo-map-id --skip-unmapped windowsAroundMySequences.bed gencode.vM25.genes.bed > answer.bed
How you define windowsAroundMySequences.bed
is up to you. You could do something like the following, say, to make strand-specific 5kb proximal promoter windows:
$ awk -v FS="\t" -v OFS="\t" '($6=="+"){ print $1, $2-5000, $2, $3, $4, $5, $6 }($6=="-"){ print $1, $3, $3+5000, $4, $5, $6 }' mySequences.bed > windowsAroundMySequences.bed
Depending on your mouse experiment, the Gorkin et al. fetal dataset housed on the epilogos site might be of interest for demarcating enhancers. There is a tabix-based data download available from the top-right corner of the page for doing queries for columns that have locally- or globally-high surprisal values for enhancer chromatin states (columns 5-9 in the score data portion of the query result).
Or perhaps get the database for mm9
at http://www.enhanceratlas.org/ and use liftOver
to get mm10
enhancer regions. Once you have those, you can use bedops
or bedmap
to do overlap or association queries between enhancers and your windows-of-interest.
As to enrichment, you could use all genes as background and count overlap events over a subset of genes of interest and over background, using a hypergeometric to calculate the probability of observing such overlaps by chance. You'd need to decide what genes are interesting, however.
Or perhaps you'd synthesize a population of sequences with a similar distribution to what you are starting with, and you would count how many times such random sequences overlap your TSS windows, to measure the probability that your specific sequences overlap their TSSs by chance.