I have a list of genes and I am trying to determine whether they contain specific TF binding sites in their promoter region.
I am retrieving 3kb upstream of TSS using getPromoterSeq
and then use TFBSTools
and JASPAR2020
to look for TF binding sites.
To simplify things, let's take an example gene / TF pair
transcripts <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
sequence <- getPromoterSeq(query = transcripts$`19109`, subject = Mmusculus,
upstream = 0, downstream = 3000)
pfm <- getMatrixByID(JASPAR2020, "MA0141.2")
res <- searchSeq(toPWM(pfm), sequence, min.score = "80%")
res.df <- data.frame(writeGFF3(res))
res.df$score <- relScore(res)[[1]]
res.df$pval <- pvalues(res)[[1]]
This returns four hits, as seen below
seqname source feature start end score strand frame
1 1 TFBS TFBS 45 62 0.8484427 + .
2 2 TFBS TFBS 42 59 0.8484427 + .
3 3 TFBS TFBS 41 58 0.8484427 + .
4 4 TFBS TFBS 38 55 0.8484427 + .
attributes pval
1 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
2 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
3 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
4 TF=ESR1;class=Nuclear receptors with C4 zinc fingers;sequence=AGCAGTCACCATGACCAT 1.311513e-05
If I perform the same analysis on the JASPAR2020 website I only get 1 hit, from position 45 to 62, but with a slightly lower score of 0.81323616527. In general, when I use multiple sequences and matrices, R always returns more hits than the website, and with consistently higher scores.
I have never done this type of analysis before, and I am wondering whether I am doing something wrong. In the simple example above, it is clear that (likely because of some redundancy in the motif) a lot of the hits overlap, and I am wondering whether the JASPAR website gets rid of those? This, however, does not explain some of the missing hits, as they are not overlapping.
This brings me to my next question, which is about extra filtering. It generally looks like I get a lot of hits. The JASPAR website warns that "this type of analysis has a high sensitivity but abysmal selectivity", so I am wondering what the best approach is to further validate hits. They talk about phylogenetic footprinting, but it is not too clear to me how to implement it.
Would a bootstrapping approach using randomly shuffled matrices be good?
Thank you very much, Alex, I will read through that and see if I can get my way through it!
On top of that (seconding the
fimo
suggestion) I would narrow down the promoters to like 500bp, not 3kb to avoid excessive motif matches by chance. By experience with ATAC-seq data the actual open chromatin region upstream of TSS is by far not 3kb, rather like 500bp or so. This eventually helps reducing the multiple-testing burden (if you choose to use the q-values of fimo).