Entering edit mode
5.6 years ago
bioinfo1234
•
0
Hi, I am looking to find all occurrences of a nucleotide pattern in a multifasta genome, with n number of mismatches in one part of the pattern, and m number of the mismatches in the rest.
For example:
AGCAGCATAGCAGCAAGCAGT[up to 4 mismatches]GCAGACGCA[UP TO 2 MISMATCHES]
Does anyone know how to search for this type of complex pattern. An existing tool, or perl/python script ? Ambiguous symbols such as N and R incorporation would be much needed as well.
thanks for posting your answers.
Hi, I'm not aware of any tool which does this out of the box (I' would be happy to be corrected).
I have 2 suggestions:
1) implement fast scoring function and traverse the sequence
2) use library with regex with implemented non-exact matching. eg. https://pypi.org/project/regex/