Search for pattern within multiple sequence alignement
0
0
Entering edit mode
3.5 years ago
jmungar2 ▴ 10

Hello,

I need to search for a pattern within a multiple sequence alignment allowing any number of - or . symbols to be including within the characters of the patter. For example, I want to search for the string pattern RAGTLQYD (see bold characters) within the alignment below, and to do so I have to ignore any number of - and . symbols that appear between the characters of the pattern. Also, I want to print out the position in the alignement where the first character of the pattern is located. So far I got to this:

from re import search, IGNORECASE import pandas as pd

df1 = pd.read_csv(multiple_sequence_alignment_file, delimiter = "\t")
matchseq = pd.read_csv(file_of_patterns) # all the patterns I want to search
for seq in matchseq:
    if search(seq, df1, IGNORECASE):
        print(seq, df1)

This works only for the patterns that do not have any - or . symbols in between. I couldn't find in the re.search manual how to specify to ignore some characters in the search. Any guidance would be really helpful.

-..-------------------------------HSLKYDKLYS.SKN..SLCYVLLIWLLTLAAVLPNLRAGTL.--.. QYDPR........IYSCTFAQSV..........SSAYTIAVVVFHFLV.PMIIVIFCYLRIWILVLQV-----------.

python search regex sequence-alignment pandas • 762 views
ADD COMMENT
1
Entering edit mode

here is the regex that works with bash. Try building this regex in python:

$ cat test.txt 
PARCTR-----.........A...G-T.......LQYDRDTCG
MRDTCR..A..G..T..lq....--YdRAGTLQYD

$ grep -iPo R\[-.\]\*A\[-.\]\*G\[-.\]\*T\[-.\]\*L\[-.\]\*Q\[-.\]\*Y\[-.\]\*D test.txt
R-----.........A...G-T.......LQYD
R..A..G..T..lq....--Yd
RAGTLQYD

But if you are handling biological sequences, I would recommend to use established bio libraries such as biopython.

ADD REPLY

Login before adding your answer.

Traffic: 2788 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6