Hello,
So, I have a list of PWM for TF from Transfac Pro. I have mapped the possible matrix name to their corresponding gene via Ensembl gene ID. In total, I have 1082 genes assigned to the PWM. Some PWMs are heterodimer so 1 matrix can be assigned to 2 or more genes. My goal is to create a network among these transcription factor via binding matrix.
My workflow is:
I extract all 1082 genes promoter region. I set this to 500 nucleotide before first nucleotide in the first exon.
I use FIMO from MEME-suite to make the binding prediticion with p-value set to 1e-4.
In the end, I want to get what TF regulate another TF and the end result would be a network of TFs that regulate each other.
I have finished the FIMO calculation and after I mapped the matrix to gene name, I have around 200,000 TF-target pair.
After I check the number of TF for each gene and number of target for each TF, the number doesn't feel right.It is way too many.
For example, 1 TF can have almost 900 target genes. On the other hand, 1 gene can have more than 900 TF. I understand that this calculation only calculate whether the PWM can match some pattern in the DNA sequence in the binding region.
My question is, is there any way to at least filter the TF-Target pair, not only using p-value but also other factor such as TF combination. It is impossible to have 900 TF while the binding region is only 500 nucleotide long.
Thank you for your opinion and suggestion.
So, the species is human. I generate the TSS by using the gene annotation and human gene reference. Basically, I extract for my list of gene, the location of first exon. Then, I just calculate 500nt before that. For example, from the GTF file, I get gene X, in the chr1, and the first exon location is 100,000-101,000. My TSS for gene X is just chr1 from 99,500-99,999. Then I use bedtools to extract the sequence from human genome reference HG38 from ensembl. It is just that. I have MATCH from Transfac but I haven't tried it yet. I am more familiar with FIMO. As for the cutoff, it is default from FIMO. As for validate, this is what I want to do. I want to use this to make a base model and integrate other things, for example mutation and gene expression level. In a sense, you can think that I will use gene expression level as a validator to validate the relationship between TF-target. I want to develop some mathematical model that can do this and this is my research topic.
Very cool research topic indeed. Your approach is good as a starting point (you took into account that genes on reverse strand have to be looked up from another direction and the sequences reverse complimented?). Now you will be better off by adding a bit of sequence downstream of TSS for the reasons mentioned above. For humans then you want to remove false positives for a proper model, so you can think of ways to do it. One is narrowing down the list of TFs, as I said, another is too narrow down the areas on promoters where TFs usually binds. There are many ideas to chose from nucleosome binding, histone modifications, SNP positions and cytosine methylation to physical DNA conformation, kinetic models of TFs binding to DNA and other TFs. For research topic, you might be better off either with stable homozygous lines (i.e. mice models) or with self-pollinating plants like Medicago or arabidopsis, otherwise.
Also select carefully between NCBI RefSeq and Ensembl. Consider who is going to use your research findings and how.
To generate the "promoter region candidate", I use bedtools that already forces to give reverse strand sequence so I think there is no problem in that.
For downstream of TSS, can you please explain what is that? I have heard a lot about downstream regulatory region but I still can not understand it. Is it the region after the last exon?
For narrowing down, I think using MCAST from meme suite is a good way. That tool can give cluster of TFs although the problem is I can not just search thousands of patttern in one go. I need to reduce the number of TF candidate first then generate the TF cluster using MCAST.
As for narrow down where TF usually binds, unfortunately I don't have enough data like methylation, histone modification also SNP in the promoter region. I plan to generate this as alternative output of my research. So, my research will give both predicted configuration of TF and if no good result can be given, I want to be able to give some reason why. The reason maybe these modification, mRNA, or other else.
Hi! How was the experience with FIMO ?
I am curious about
* how did you selected the window for FIMO? e.g. 1-2-5Kb or >5Kb regions for promoters? I am using RSAT-tools for that
* which parameters did you set (1e-4 for p-value i think is a bit low, i am considering to select 1e-4 for q-value instead with the --qv-thresh option but im not sure the way to go)
* how did you validate your model ?
Thanks in advance!