CENTIPEDE integrates experimental evidence with prior information to determine whether a particular genome location is bound by some transcription factor (or other DNA-binding protein). The X matrix includes the experimental evidence, for example the cuts inferred from DNaseI-seq. In the Y matrix you include the prior information, including how well that region matches the TF binding site (from the score obtained from the matching of the TF's PWM to that position), and the conservation of that genomic position (obtained, e.g. from phastCons scores). I think that is what NRSF_Anno[, 5]
and NRSF_Anno[, 6]
represent. I remember the documentation was a bit confusing but don't have it with me at this moment to check this in more detail.
EDIT
Took a look at the package and this is a quick look at the content of NRSF_Anno:
head(NRSF_Anno)
chrom hg18Start hg18End Strand PWMscore ConsScore TSSdist
1 chr1 90336 90356 - 16.69222 0.03875 31393
2 chr1 141061 141081 + 19.73801 0.18760 82118
3 chr1 236650 236670 - 16.69222 0.02165 120861
4 chr1 398305 398325 + 16.69222 0.29235 40794
5 chr1 571868 571888 - 16.69222 0.10220 40019
6 chr1 676751 676771 + 19.73801 0.05410 64864
As you can see, NRSF_Anno[, 5]
is the PWMscore
and NRSF_Anno[, 6]
is the ConsScore
(conservation score). In their paper the authors also used the distance to TSS (TSSdist
) in the model.
EDIT 2
A useful source of information regarding CENTIPEDE usage might be this tutorial in github.
NOTE
This is OT but might be useful for others interested in this package. It seems CENTIPEDE cannot be installed in R-3.3.2 anymore:
install.packages("CENTIPEDE", repos="http://R-Forge.R-project.org", type = "source")
Warning in install.packages :
package ‘CENTIPEDE’ is not available (for R version 3.3.2)
I solved this by downloading the software from the SVN repository (from here) and creating an empty file named NAMESPACE
in the root of the package. Then the package can be installed properly.
Thanks very much for the information, however I guess what I am confused by is where the NRSF_anno even comes from? I obtained a single matrix from bwtool that is indicative of how the coverage in my DNase bigWigs, and how they are oriented around the center of the meme motif output. So just trying to figure out what to make the annotation file from? Thanks again.
Rob. (was going by following instructions pulled from a paper, I do have the phyloP bw but not sure how to integrate it)
These count matrices were then used by CENTIPEDE along with conservation levels at corresponding positions (phyloP score from the placental subset of the UCSC 60-way genome alignment; Karolchik et al., 2014) to learn motif-specific models of Tn5 insertion density and predict the likelihood that each motif instance across the genome is bound. We used sites predicted with greater than 95% posterior probability to be occupied as our footprint set.
The following is from the Genome Research paper describing CENTIPEDE and is how they arrive at the NRSF_anno file, I am just not sure how to pull the data out once I have the PWM that is meme motif positions in mm10/my open chromatin data:
For each can- didate, we extracted genomic information that would be included in the model prior: sequence conservation (Pollard et al. 2010); quality of the PWM match; and distance to the nearest transcription start site; as well as experimental data in a 200–400-bp window around the site to be used in the likelihood—DNase I sensitivity and ChIP-seq data on seven histone modifications, all from LCLs.
The idea behind CENTIPEDE, as I understand it, is that you start with the predicted locations of the DNA binding sites for one or more TFs. Then from those locations you obtain both the X and Y matrix. For the X matrix you use the experimental evidence from DNase-seq (or ATAC-seq or even histone marks). For the Y matrix you use the prior information. For example the PWMScore associated with a binding location is the score value you obtained from MEME. For the conservation you have one value for each nucleotide. What I did (if I recall correctly) is to compute the mean conservation in that location. For the distance you would have to compute the distance to all TSSs in the chromosome and then get the one that is closest, and so on.