Hi, I have a list of Arabidopsis gene symbols, how to do the enrichment analysis to identify their Transcription factors? It is possible to do this by TFBStools package? thanks
Hi, I have a list of Arabidopsis gene symbols, how to do the enrichment analysis to identify their Transcription factors? It is possible to do this by TFBStools package? thanks
One possible approach is described below, which involves some manual work, but which facilitates a bit more control over inputs, outputs, and parameters:
Get the annotations of your genes. The positions of these annotations should, ideally, match the assembly version you are using for FIMO calls, described below.
These annotations should be formatted in a sorted BED6+ file, or converted to one via gtf2bed
, gff2bed
or other conversion tools that output sorted BED, with the ID in the fourth column and the strand information in the sixth column.
Pad out -1000/+200 of the annotation TSSs, by strand, via, e.g. bedops
:
$ awk '($6 == "+"){ print $1, ($2-1), $2, $4 }' annotations.bed | bedops --range -1000:200 --everything - > tss.pad.for.bed
$ awk '($6 == "-"){ print $1, $3, ($3+1), $4 }' annotations.bed | bedops --range -200:1000 --everything - > tss.pad.rev.bed
$ bedops --everything tss.pad.for.bed tss.pad.rev.bed > promoters.bed
You might change these bounds depending on what you define as a promoter, or other regulatory region where TFs would bind to and regulate gene activity.
Do a FIMO scan at 1e-4 or other p-value threshold against your plant TF database(s) of choice (TRANSFAC, JASPAR, Athamap and CIS-BP are possibilities, for instance).
I have an answer on the Bioinformatics SE site that explains how to do a FIMO scan for hg19
(human), against the non-redundant JASPAR vertebrate TF model database: https://bioinformatics.stackexchange.com/a/2491/776
If you use FIMO, you would repeat this or something similar for your assembly of Arabidopsis and for published TF model databases for Arabidopsis.
The output of FIMO will be a collection of TF binding sites (TFBS) over your chosen assembly of Arabidopsis, in BED format. (Make sure that this result is sorted per sort-bed
, as described in the SE answer.)
Look for overlaps of, say, three or more bases between the file of padded TSSs (promoters.bed
) and the TFBS that came out of running FIMO (fimo.bed
):
$ bedmap --echo --echo-map-id-uniq --delim '\t' --bp-ovr 3 promoters.bed fimo.bed > answer.bed
Or if you want the full TFBS annotation, and not just the TF model names:
$ bedmap --echo --echo-map --delim '\t' --bp-ovr 3 promoters.bed fimo.bed > answer.bed
Repeat steps 1-4 of this analysis for background ("random") selections of genes over the whole genome. You could use shuf -n
or sample
or similar to get a random sample of genes from a text-formatted annotations file, then convert them to background promoters.
Once you have a collection of TF model names for your genes-of-interest and for a random selection of genes-over-background, you could use a hypergeometric test to determine if any particular TFs are enriched, given the genes-of-interest.
The following answer may help describe the use of this test in a more concrete way, for a similar scenario: A: Calculate if the co-occurring of two TFBSs is higher than one would expect by ch
Hi Alex, thanks a lot for pointing me to the answer! Couple questions: 1. step4 is filtering to keep TFBS overlapping with annotated promoter regions? I am not very familiar with FIMO yet, why would it output TFBS that are outside of the provided promoter region? 2. step5 is essentially what Homer is doing as the last step to identify enriched motif? If I have one gene, and trying to find the TF binding motif for that gene, can I still use this approach?
Some possible answers:
In the approach I describe, FIMO scans the entire genome for binding sites (in practice, minus masked regions) — more than just promoters. So using bedmap
restricts the set of binding sites to those which overlap by at least (in this case) three bases with a promoter region. You could specify other overlap criteria — something more or less relaxed — depending on how stringent you want to be. The idea is to connect promoters for your gene of interest, by filtering the TF binding sites that FIMO calls for those which overlap your promoters.
Someone more familiar with Homer may have a different answer, but I believe Homer is analogous to the MEME tool, which discovers motifs, not binding sites. MEME scans the regions of interest, looking for patterns of sequences, calling them motifs. The output of MEME is one or more PWMs — positional weight matrices — which give a numerical definition to these motifs. FIMO takes these PWMs as input and looks over the genome for sites where these patterns are found. If the PWMs are made from data for transcription factors, then these sites would be called TFBS. I may be wrong about what Homer does, but I think that's what it does, mainly.
You could certainly use these steps for looking for TFBS in only one gene's promoter. You might still keep the FIMO results around, though, in case you run into other genes-of-interest where you might be interested in regulatory TFs. You might also explore regulatory networks connected to your tissue of interest and potential TFs of interest (http://regulatorynetworks.org, for instance).
This is a computational approach. While FIMO can run some numbers and tell you that a binding site exists with so-and-so statistical significance, ChIP-seq data is also useful for giving experimental, biological confirmation that a TF actually binds to a gene's regulatory regions. Queries to ENCODE might be useful here, also (https://encodeproject.org).
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What do you mean their transcription factors? TFs that bind to their promoters? Regardless, HOMER or AME from the MEME suite are probably your best bets. From a quick glance, it doesn't seem like the TFBStools package does motif enrichment analyses.
thank you very much, but I am looking for a R package.
PWMenrich may work for you then, though I've never used it and can't vouch for its results/ease of use.
thank you, I will try it.