You should try MOODS: it's a suite of algorithms for matching position weight matrices (PWM) against DNA sequences. I use it on a daily basis, it's a very good piece of software written in C++ with interface for python (a simple import MOODS
, and up you go!). The must difficult step is to convert your query sequences in PWM. Here is the function I use:
def primer2pwm(primer):
"""
Write a primer sequence as a position weight matrix.
"""
# Create 4 lists of length equal to primer's length.
matrix = [[0] * len(primer) for i in range(4)]
# List of correspondance IUPAC.
IUPAC = {
"A" : ["A"],
"C" : ["C"],
"G" : ["G"],
"T" : ["T"],
"U" : ["U"],
"R" : ["G", "A"],
"Y" : ["T", "C"],
"K" : ["G", "T"],
"M" : ["A", "C"],
"S" : ["G", "C"],
"W" : ["A", "T"],
"B" : ["C", "G", "T"],
"D" : ["A", "G", "T"],
"H" : ["A", "C", "T"],
"V" : ["A", "C", "G"],
"N" : ["A", "C", "G", "T"]
}
# Position of nucleotides in the PWM.
dico = {"A" : 0, "C" : 1, "G" : 2, "T" : 3}
# Read each IUPAC letter in the primer.
for index, letter in enumerate(primer):
for nuc in IUPAC.get(letter):
i = dico.get(nuc)
matrix[i][index] = 1
return matrix
Hi Abhi,
Were you able to implement approximate searching?
Thanks, Jitendra