Question

Tool:pytfmpval: Efficient, accurate p-value computation for position weight matrices

2

Entering edit mode

7.2 years ago

jared.andrews07 ★ 19k

Overview

pytfmpval is a python package that wraps the excellent TFM-pvalue program for convenient and high-throughput use. I found this program to be one of the better ones for motif thresholding. There is also an R package for it.

It allows users to determine log-likelihood ratio score thresholds for a given transcription factor position frequency matrix associated with a specific p-value. Naturally, it can also perform the reverse, quickly calculating an accurate p-value from a score for a given motif matrix. This is useful for setting thresholds for TF motifs before scanning for them or assessing their perturbation/creation from genetic variants.

The package is hosted on PyPi and easily installed with pip install pytfmpval.

Usage Example

JASPAR is a popular transcription factor motif database from which motif count matrices can be downloaded for a large variety of organisms and transcription factors. There exist numerous other motif databases as well (TRANSFAC, CIS-BP, MEME, HOMER, WORMBASE, etc), most of which use a relatively similar but different enough to be annoying format for their motifs. Typically, a motif file consists of four rows or columns with each position in a given row or column corresponding to a base within the motif. Sometimes there is header line starting with >. The row or column order is always A, C, G, T. In this example, the motif consists of four rows corresponding to the 16 positions of the motif with counts for each base at each position. We first calculate the score threshold for a certain p-value, then show that the p-value yields the same score.

>>> from pytfmpval import tfmp
>>> m = tfmp.create_matrix("MA0045.pfm")
>>> tfmp.pval2score(m, 0.00001)
8.773708000000001
>>> tfmp.score2pval(m, 8.7737)
9.992625564336777e-06

This could also be done by creating a string for the matrix by concatenating the rows (or columns) and using the read_matrix() function. This method is sometimes easier, as it allows the user to parse the motif file as necessary to ensure a proper input.

>>> from pytfmpval import tfmp
>>> mat = (" 3  7  9  3 11 11 11  3  4  3  8  8  9  9 11  2"
...        " 5  0  1  6  0  0  0  3  1  4  5  1  0  5  0  7"
...        " 4  3  1  4  3  2  2  2  8  6  1  4  2  0  3  0"
...        " 2  4  3  1  0  1  1  6  1  1  0  1  3  0  0  5"
...       )
>>> m = tfmp.read_matrix(mat)
>>> tfmp.pval2score(m, 0.00001)
8.773708000000001
>>> tfmp.score2pval(m, 8.7737)
9.992625564336777e-06

Reference

Efficient and accurate P-value computation for Position Weight Matrices

H. Touzet and J.S. Varré. Algorithms for Molecular Biology 2007, 2:15

ChIP-Seq p-value motifs python pwm • 3.5k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 7.2 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

Hi Jared, Which toll would you recommend using for scanning sequences for position weight matrices using the threshold that this tool provides? Thanks

ADD REPLY • link 6.3 years ago by Marouen Ben Guebila • 0

1

Entering edit mode

FIMO is very easy to use and my general default for scanning sequences for motifs. It will require you to do a little manual python coding to get all of the motifs in one file (if you have multiple).

ADD REPLY • link 6.3 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

Hi Jared, Which tool would you recommend to scan sequences for position weight matrices using the threshold that this tool provides? Thanks

ADD REPLY • link 6.3 years ago by Marouen Ben Guebila • 0

0

Entering edit mode

Hi, Jared. I have use your python package to try it, get the threshold score. I am a freshman ,can you give me some suggestions for calculate sequence match score? Thanks!