I'm using the R-package TFBSTools
to prodict TF binding sites:
pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+",
bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(),
profileMatrix=as.matrix(pfm))
peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L)
I am curious as to what I should use as a background probability for the nucleotides (see that I have simply used prob=0.25 for all 4 nucs)... I can't seem to find an official reference for the GRCh38 genome of this kind anywhere...
I found with the R package MEET
a reference probability list: c(A=0.32,T=0.32,G=0.18,C=0.18)
.
However I am not certain if this profile is suitable in this situation- given that not all regions in the genome maintain these ratios (e.g. genes are GC rich while non-coding regions are AT rich)...
Does anyone know if I should just stick to the 0.25
prob split 4 ways or is a tailored profile more appropriate?