Regex motif question - 2 or more residues out of XXXX are D/E ?
1
1
Entering edit mode
10.5 years ago

I would like to use regular expressions to identify a motif in an amino acid sequence. Part of the the motif is described as '2 or more out of XXXX are D or E'. I wonder if there is a way to specify this part directly with regular expressions instead of writing out all the alternatives or using a more iterative approach.

I'm actually using this in the find box of my editor (sublime text) as it accepts regex (not sure what extensions/definitions it goes to). Otherwise a perl version of regex is where I would implement this.

Thanks!

edit: changed title slightly

edit: changed question to include or more.

motif regular-expressions perl • 3.0k views
ADD COMMENT
0
Entering edit mode

What makes you think a regular expression captures such a soft rule? There's not much regular about it. Regex are for phone numbers and email addresses. This could be solved quickly with a sweep procedure looking at all 4-mers along the sequence.

ADD REPLY
0
Entering edit mode

I agree, this problem (N out of M == X) can't be solved with a regular expression unless you use the regex that enumerates all possible cases: eg: (2+ out of 4 == A)

/..AA|.A.A|A..A|.AA.|AA..|A.AA|AA.A|AAA.|.AAA|AAAA/
ADD REPLY
1
Entering edit mode
10.5 years ago

I think I've figured it out now, using lookahead (?=pattern) to link two regular expressions like an AND:

(?=.?[DE]{1,4}.?[DE]{1,2}.?).{4}

The first part (in brackets) stipulates the pattern described by the following part must have at least 2 Ds or Es which may have other characters before, after or between them. The second part (following brackets) says the result must be four characters long.

EDIT: PLUS an alternate with two wildcard characters in the middle

(?=.?[DE]{1,4}.?[DE]{1,2}.?|[DE]..[DE]).{4}

I'm not sure how this would deal with overlapping motifs (I only came across regular expressions recently) but this is adequate for my needs now.

ADD COMMENT
1
Entering edit mode

This is unfortunately incorrect, you can test your regex like so:

perl -ne ' @x = /(?=.?[DE]{1,4}.?[DE]{1,2}.?).{4}/; print scalar @x,"\n"; '

It doesn't work for pattern DXXD, DXXDX, XDXXDX, etc.

ADD REPLY
0
Entering edit mode

Thanks! very observant

I've added on an inelegant alternate that mops those up now :(

ADD REPLY
0
Entering edit mode

Try Perl's transliteration operator:

use strict;
use warnings;

while ( my $string = <DATA> ) {
    chomp $string;
    my $count = $string =~ tr/deDE//;
    my $twoPlus = $count > 1 ? '*' : '';
    print "$string: $count$twoPlus\n"
}

__DATA__
XXXXXXXXXX
DDXXXXXXDX
XXEXXXDXEX
XXDXXXXXXX
DEDEDEDEDE
DXXXXXXXXE

Output:

XXXXXXXXXX: 0
DDXXXXXXDX: 3*
XXEXXXDXEX: 3*
XXDXXXXXXX: 1
DEDEDEDEDE: 10*
DXXXXXXXXE: 2*

Hope this helps!

ADD REPLY

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6