Question

Perl Regex / Substitution For Masking Sequence

6

Entering edit mode

14.3 years ago

Harry Palmer ▴ 90

Hi all,

I can't figure out how to get perl to substitute the same number of characters AND at the same position as were matched by a regex.

For example, the string "AAAAAAANNNNNNNNNNAAANNNNNNNNNNN";

I want to mask out any contiguous sequence of "A" less than 5 long, where ever they appear.

Thanks for all help, Harry

perl • 6.7k views

ADD COMMENT • link updated 13.9 years ago by Neilfws 49k • written 14.3 years ago by Harry Palmer ▴ 90

Ram · Answer 1 · 2010-08-16

4

Entering edit mode

14.3 years ago

Jts ★ 1.4k

There is probably a cleaner way to do it but this is what I came up with:

perl -pe "s/([^A])(A{1,4})([^A])/\1\L\2\E\3/g"

Here is the output for the example you provided:

echo "AAAAAAANNNNNNNNNNAAANNNNNNNNNNN" | perl -pe "s/([^A])(A{1,4})([^A])/\1\L\2\E\3/g"
AAAAAAANNNNNNNNNNaaaNNNNNNNNNNN

The first part matches any non-A character, then at most 4 As followed by another non-A character. All three matches are captured in groups 1, 2, 3. The second part prints out the matches after converting the second match (the run of As) to lower case (using L and E).

Edit: As Michael and Lars point out, there are a few corner cases. It might be simplest to first convert all the As to lower case, then use a much simpler regex to convert the long runs of a back to upper case:

echo "AAAANANANANANAAAAAAANNAAAA" | perl -pe "s/A/a/g" | perl -pe "s/a{5,}/\U$&\E/g"
aaaaNaNaNaNaNAAAAAAANNaaaa

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 14.3 years ago by Jts ★ 1.4k

1

Entering edit mode

This works, except for the case of trailing As. See below.

ADD REPLY • link 14.3 years ago by Michael 55k

0

Entering edit mode

This will also not work with leading As or for cases like NANANAN where only the first and the third instance will be replaced. See below.

ADD REPLY • link 14.3 years ago by Lars Juhl Jensen 11k

Ram · Answer 2 · 2010-08-17

4

Entering edit mode

14.3 years ago

Michael Kuhn 5.0k

After programming more Python than Perl, I'd propose to first isolate all the A's (split) and then decide if a string of A's is <5 chars. The map statement applies the regex to each list item. At the end, glue everything together.

#!/usr/bin/perl

use strict;
use warnings;

my $s = "AAANNNNNNNNNNAAANNNNANNNNNNAAAAAAAAAAANA";
my $t = join "", map { s/^(A{1,4})$/\L$1\E/; $_ } split(/(A+)/, $s);

print "$s\n$t\n";

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 14.3 years ago by Michael Kuhn 5.0k

0

Entering edit mode

Seeing that the logic is written backwards, I wonder how this would like in Ruby. Could anyone please translate it?

ADD REPLY • link 14.3 years ago by Michael Kuhn 5.0k

0

Entering edit mode

I like this solution too. And it's guaranteed to be correct.

ADD REPLY • link 14.3 years ago by Michael 55k

0

Entering edit mode

Good solution. I'll add a Ruby version as an answer.

ADD REPLY • link 14.3 years ago by Neilfws 49k

Ram · Answer 3 · 2010-08-16

I think it's not possible to do this in a single and standard POSIX regexp because you have to count and take the result for the replacement. So I take as a kind-of code golf using extended perl regular expressions. The perldoc says this is experimental, and it might be low performance, but:

s/(A+)(?{$r = (length($1) > 4) ? $1 : lc($1)})/$r/g

And here the test comparing to the answer from jts:

$_ = "AAAGAAAAANNNANNNNAANNANNNAAAAAAAAAANNCGTAAA";
s/(A+)(?{$r = (length($1) > 4) ? $1 : lc($1)})/$r/g;
print $_, "\n";
$_ = "AAAGAAAAANNNANNNNAANNANNNAAAAAAAAAANNCGTAAA";
s/([^A])(A{1,4})([^A])/\1\L\2\E\3/g;
print $_, "\n";

Output:

$ perl regexp.pl 
aaaGAAAAANNNaNNNNaaNNaNNNAAAAAAAAAANNCGTaaa
AAAGAAAAANNNaNNNNaaNNaNNNAAAAAAAAAANNCGTAAA

Ram · Answer 4 · 2010-08-18

In response to Michael's request for a Ruby version of his code:

#!/usr/bin/ruby

s = "AAANNNNNNNNNNAAANNNNANNNNNNAAAAAAAAAAANA"
t = s.split(/(A+)/).map { |e| /^(A{1,4})$/.match(e) ? e.downcase : e }.join("")
puts "#{s}\n#{t}"

Quite similar: split the string on (A+) into an array, substitute each element for lower case if it matches "AAAA", then glue back together using join.

This is a surprisingly difficult problem using regex alone; I'd be interested to see if there is a "simple" solution.

Ram · Answer 5 · 2010-08-16

You can handle both leading, trailing, and intervening stretches less than 5 long with the following regular expression substitution:

s/(^|[^A])(A{1,4})([^A]|$)/\1\L\2\E\3/g

There is one problem, though. In case of the sequence NANA only the first A will be replaced. The reason for this is that the second N is consumed by the third parenthesis in the regular expression, for which reason it cannot be matched by [^A] in the first parenthesis to produce a second match. The only solution that I can think of is to apply the regular expression twice, which is sufficient to ensure that all matches will be masked. In this case one has to remember to make sure that [^A] does not match the a produced by the first round of matching, which is done with a trivial modification:

s/(^|[^Aa])(A{1,4})([^Aa]|$)/\1\L\2\E\3/g

Applying this regular expression twice should do the job. It is a bit of a hack, I admit.

Edit: An alternative solution is to first mask all A-stretches and then subsequently unmask the ones that are 5 or longer:

s/A/a/g;
s/(a{5,})/\U$1\E/;

Ram · Answer 6 · 2010-08-18

1

Entering edit mode

14.3 years ago

Neilfws 49k

Here's a Perl solution using the special regex variables @- and @+.

#!/usr/bin/perl -w

use strict;
my $s = "AAANNNNNNNNNNAAANNNNANNNNNNAAAAAAAAAAANA";
print "$s\n";

while($s =~/(A+)/g) {
  if($+[0] - $-[0] < 5) {
    substr($s, $-[0], $+[0] - $-[0], lc($&));
  }
}

print "$s\n";

$-[0] gives the start position of the match (zero-based), $+[0] the end position. We can use those variables to calculate whether the match length is 4 or less. Then we can use them again to supply the OFFSET and LEN arguments to substr(), replacing the string of A with a by converting the match to lowercase.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 14.3 years ago by Neilfws 49k

0

Entering edit mode

wow, I didn't know about this special variable (only about $1, $2 etc.) -- but I guess no-one knows all the Perl variables. Doc: http://perldoc.perl.org/perlvar.html

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 14.3 years ago by Michael Kuhn 5.0k