background nucleotide probability when finding TF binding sites in GRCh38
1
0
Entering edit mode
2.2 years ago

I'm using the R-package TFBSTools to prodict TF binding sites:

pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+",
      bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), 
      profileMatrix=as.matrix(pfm))
peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L)

I am curious as to what I should use as a background probability for the nucleotides (see that I have simply used prob=0.25 for all 4 nucs)... I can't seem to find an official reference for the GRCh38 genome of this kind anywhere... I found with the R package MEET a reference probability list: c(A=0.32,T=0.32,G=0.18,C=0.18).

However I am not certain if this profile is suitable in this situation- given that not all regions in the genome maintain these ratios (e.g. genes are GC rich while non-coding regions are AT rich)...

Does anyone know if I should just stick to the 0.25 prob split 4 ways or is a tailored profile more appropriate?

binding transcription factor • 560 views
ADD COMMENT
1
Entering edit mode
2.2 years ago

Usually, this kind of information can be found in the vignettes that accompany Biocondutor packages. You can at least see, if they bother to use any particular values.

My gut feeling would be, that it doesn't matter too much since this information is just used for the pseudocount calculation. You can also generate two PWMs with different bg values and try it out, if the number of resulting peaks is a lot different?

ADD COMMENT

Login before adding your answer.

Traffic: 2056 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6