Question

Filtering the result of transcription factor binding matrix prediction from PWM with FIMO software

1

Entering edit mode

7.6 years ago

bharata1803 ▴ 560

Hello,

So, I have a list of PWM for TF from Transfac Pro. I have mapped the possible matrix name to their corresponding gene via Ensembl gene ID. In total, I have 1082 genes assigned to the PWM. Some PWMs are heterodimer so 1 matrix can be assigned to 2 or more genes. My goal is to create a network among these transcription factor via binding matrix.

My workflow is:

I extract all 1082 genes promoter region. I set this to 500 nucleotide before first nucleotide in the first exon.
I use FIMO from MEME-suite to make the binding prediticion with p-value set to 1e-4.

In the end, I want to get what TF regulate another TF and the end result would be a network of TFs that regulate each other.

I have finished the FIMO calculation and after I mapped the matrix to gene name, I have around 200,000 TF-target pair.

After I check the number of TF for each gene and number of target for each TF, the number doesn't feel right.It is way too many.

For example, 1 TF can have almost 900 target genes. On the other hand, 1 gene can have more than 900 TF. I understand that this calculation only calculate whether the PWM can match some pattern in the DNA sequence in the binding region.

My question is, is there any way to at least filter the TF-Target pair, not only using p-value but also other factor such as TF combination. It is impossible to have 900 TF while the binding region is only 500 nucleotide long.

Thank you for your opinion and suggestion.

transcription binding matrix • 3.0k views

ADD COMMENT • link updated 7.6 years ago by Petr Ponomarenko ★ 2.8k • written 7.6 years ago by bharata1803 ▴ 560

score 1 · Answer 1 · 2017-05-11

1

Entering edit mode

7.6 years ago

Petr Ponomarenko ★ 2.8k

First, you can double check if you have PWMs that are very similar and group them because you can not have multiple things bind at the same place. Second, what are the species/tissue type? Third, do you know TSS locations (I do not know where your exons set is coming from, but it may lack 5' UTR or quality can be not good enough). Fourth, you may use other related and well-annotated species to find at least approximate window where TF likes to bind and to which types of genes, say using GO terms.

I am just curious why do you use FIMO and specifically 1e-4 cutoff? What is wrong with MATCH from Transfac Pro and its "reduce false discovery rate" parameter?

Let's say your 500 nt long sequence is random and a given PWM p-value cutoff is 1e-4. Then you get almost 500 potential targets in your 500 nt region. Your cutoff will trigger on average 0.5 targets just randomly. In real promoter region, this will find even more potential TFBSs for a given PWM, because your sequence structure and PWM structure are nonrandom and from a much smaller set of 500nt long sequences.

Another thing to consider is that some TFs like to bind downstream of TSS, while others only upstream. Rarely TF binds to both sides of TSS and functions.

Some TFs like TBP like to bind in a very narrow window relative to TSS (mostly because their binding actually defines transcription machinery assembly). This can also be taken into account. Maybe starting from such TF will be easier. Then add more TFs.

Last question, also out of curiousity, how are you planning to validate your model?

ADD COMMENT • link 7.6 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

So, the species is human. I generate the TSS by using the gene annotation and human gene reference. Basically, I extract for my list of gene, the location of first exon. Then, I just calculate 500nt before that. For example, from the GTF file, I get gene X, in the chr1, and the first exon location is 100,000-101,000. My TSS for gene X is just chr1 from 99,500-99,999. Then I use bedtools to extract the sequence from human genome reference HG38 from ensembl. It is just that. I have MATCH from Transfac but I haven't tried it yet. I am more familiar with FIMO. As for the cutoff, it is default from FIMO. As for validate, this is what I want to do. I want to use this to make a base model and integrate other things, for example mutation and gene expression level. In a sense, you can think that I will use gene expression level as a validator to validate the relationship between TF-target. I want to develop some mathematical model that can do this and this is my research topic.

ADD REPLY • link 7.6 years ago by bharata1803 ▴ 560

1

Entering edit mode

Very cool research topic indeed. Your approach is good as a starting point (you took into account that genes on reverse strand have to be looked up from another direction and the sequences reverse complimented?). Now you will be better off by adding a bit of sequence downstream of TSS for the reasons mentioned above. For humans then you want to remove false positives for a proper model, so you can think of ways to do it. One is narrowing down the list of TFs, as I said, another is too narrow down the areas on promoters where TFs usually binds. There are many ideas to chose from nucleosome binding, histone modifications, SNP positions and cytosine methylation to physical DNA conformation, kinetic models of TFs binding to DNA and other TFs. For research topic, you might be better off either with stable homozygous lines (i.e. mice models) or with self-pollinating plants like Medicago or arabidopsis, otherwise.

ADD REPLY • link 7.6 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Also select carefully between NCBI RefSeq and Ensembl. Consider who is going to use your research findings and how.

ADD REPLY • link 7.6 years ago by Petr Ponomarenko ★ 2.8k

1

Entering edit mode

To generate the "promoter region candidate", I use bedtools that already forces to give reverse strand sequence so I think there is no problem in that.

For downstream of TSS, can you please explain what is that? I have heard a lot about downstream regulatory region but I still can not understand it. Is it the region after the last exon?

For narrowing down, I think using MCAST from meme suite is a good way. That tool can give cluster of TFs although the problem is I can not just search thousands of patttern in one go. I need to reduce the number of TF candidate first then generate the TF cluster using MCAST.

As for narrow down where TF usually binds, unfortunately I don't have enough data like methylation, histone modification also SNP in the promoter region. I plan to generate this as alternative output of my research. So, my research will give both predicted configuration of TF and if no good result can be given, I want to be able to give some reason why. The reason maybe these modification, mRNA, or other else.

ADD REPLY • link 7.6 years ago by bharata1803 ▴ 560

0

Entering edit mode

Hi! How was the experience with FIMO ?
I am curious about
* how did you selected the window for FIMO? e.g. 1-2-5Kb or >5Kb regions for promoters? I am using RSAT-tools for that
* which parameters did you set (1e-4 for p-value i think is a bit low, i am considering to select 1e-4 for q-value instead with the --qv-thresh option but im not sure the way to go)
* how did you validate your model ?

Thanks in advance!

ADD REPLY • link 3.9 years ago by lessismore ★ 1.4k