Hello,
I need your help regarding transcription factors and their gene targets (association with protein coding genes).
What do I have?
List of several protein coding genes and experimentally confirmed variants located in their cis-regulatory elements (CREs), up to 3kbp from TSS in cancer cell lines.
What do I want to achieve?
I'd like to check the influence of the variants on TFBS. The example result: wild version of gene A it's a target (is regulated) by some transcription factor, let's say MYC. However mutated version of this gene (some sequence variants near TSS - in regulatory region) is regulated by different transcription factor or due to mutations the association between this gene and MYC is weakened or even removed. Therefore we can say the expression of gene A is altered due to the variants in cis-regulatory elements in cancer cells. Then after this in silico analysis I'd like to perform some functional analysis in wet lab. I was thinking to take the best candidates and do some siRNA silencing of transcription factor, ChIP qPCR, transfection and cloning.
What I did so far?
I used two tools to do such analysis.
- morifbreakR (https://bioconductor.org/packages/release/bioc/html/motifbreakR.html)
- RSAT (http://rsat.sb-roscoff.fr/)
Regarding motifbreakR i set p-value relatively high, namely 0.05. Only 9 of 55 variants were considered as interesting (returned by software). Then I used matrix-scan (RSAT). I run the analysis with cisBP Homo_sapiens - [6607 motifs] (2019-06_v2.00) in transfac format. The input sequence was the region (CRE) near TSS of the gene. I decided to try this matrix-scan twice. First with wild sequence and second with mutated sequence (I inserted variants manually). I obtained two sets of transcription factors (sites) with sequences. I compared these two sets using R and I received TFs characteristic (that differs between sets) for wild and mutated sequences. So actually I did what I wanted. However, I performed the same analysis using ENCODE (Human TFs) - [2065 motifs] (2018-03) instead of cisBP and results are completely different (cisBP vs ENCODE). Moreover when I check the region of interest in UCSC GenomeBrowser with added tracks: Jaspar 2020 TFBS and Transcription Factor ChIP-seq Clusters (161 factors) from ENCODE with Factorbook Motifs there's no overlap of TFBS between ENCODE and Jaspar.
Unfortunately I need certain results before I start work in the wet lab, I'd like to avoid any false positives. So my question are:
Is my strategy correct?
Why every DB with TFBS and TF's is showing different results comparing to others?
Is there database with accurate and certain results? Which one should I believe? Is ENCODE enough?