Question

Given set of genomic sequences find potentially enriched genes?

0

Entering edit mode

3.1 years ago

chipolino ▴ 150

Hello!

I have around 2,000 mouse genomic sequences (mm10). They all have the same length (200 bp) and I know their coordinates. I would like to know the following things:

if these sequences overlap any functional elements (TSS regions, enhancers, etc);
if these sequences are enriched in TSS regions, then what genes do they correspond to?

It would be amazing if there was a tool or any other simple way of doing that, thanks.

enrichment genes mouse • 1.1k views

ADD COMMENT • link updated 3.1 years ago by EagleEye 7.6k • written 3.1 years ago by chipolino ▴ 150

score 1 · Answer 1 · 2021-10-26

I'm not sure there is any one tool that will do all of this for you. Perhaps some of the following might help.

After downloading genes for mm10, construct a list of windows upstream or centered on TSSs, and overlap them or associate them with your coordinates:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' - \
    | convert2bed -i gff - \
    > gencode.vM25.genes.bed

Then:

$ bedmap --echo --echo-map-id --skip-unmapped windowsAroundMySequences.bed gencode.vM25.genes.bed > answer.bed

How you define windowsAroundMySequences.bed is up to you. You could do something like the following, say, to make strand-specific 5kb proximal promoter windows:

$ awk -v FS="\t" -v OFS="\t" '($6=="+"){ print $1, $2-5000, $2, $3, $4, $5, $6 }($6=="-"){ print $1, $3, $3+5000, $4, $5, $6 }' mySequences.bed > windowsAroundMySequences.bed

Depending on your mouse experiment, the Gorkin et al. fetal dataset housed on the epilogos site might be of interest for demarcating enhancers. There is a tabix-based data download available from the top-right corner of the page for doing queries for columns that have locally- or globally-high surprisal values for enhancer chromatin states (columns 5-9 in the score data portion of the query result).

Or perhaps get the database for mm9 at http://www.enhanceratlas.org/ and use liftOver to get mm10 enhancer regions. Once you have those, you can use bedops or bedmap to do overlap or association queries between enhancers and your windows-of-interest.

As to enrichment, you could use all genes as background and count overlap events over a subset of genes of interest and over background, using a hypergeometric to calculate the probability of observing such overlaps by chance. You'd need to decide what genes are interesting, however.

Or perhaps you'd synthesize a population of sequences with a similar distribution to what you are starting with, and you would count how many times such random sequences overlap your TSS windows, to measure the probability that your specific sequences overlap their TSSs by chance.

score 0 · Answer 2 · 2021-10-27

0

Entering edit mode

3.1 years ago

EagleEye 7.6k

You can use the coordinate file (as peak file) and gtf annotation file (appropriate to the genome version you are using) of mouse in homer to get it done.

ADD COMMENT • link 3.1 years ago by EagleEye 7.6k