Hello,
I need to search for a pattern within a multiple sequence alignment allowing any number of - or . symbols to be including within the characters of the patter. For example, I want to search for the string pattern RAGTLQYD (see bold characters) within the alignment below, and to do so I have to ignore any number of - and . symbols that appear between the characters of the pattern. Also, I want to print out the position in the alignement where the first character of the pattern is located. So far I got to this:
from re import search, IGNORECASE import pandas as pd
df1 = pd.read_csv(multiple_sequence_alignment_file, delimiter = "\t")
matchseq = pd.read_csv(file_of_patterns) # all the patterns I want to search
for seq in matchseq:
if search(seq, df1, IGNORECASE):
print(seq, df1)
This works only for the patterns that do not have any - or . symbols in between. I couldn't find in the re.search manual how to specify to ignore some characters in the search. Any guidance would be really helpful.
-..-------------------------------HSLKYDKLYS.SKN..SLCYVLLIWLLTLAAVLPNLRAGTL.--.. QYDPR........IYSCTFAQSV..........SSAYTIAVVVFHFLV.PMIIVIFCYLRIWILVLQV-----------.
here is the regex that works with bash. Try building this regex in python:
But if you are handling biological sequences, I would recommend to use established bio libraries such as biopython.