Perl Regex / Substitution For Masking Sequence
6
6
Entering edit mode
14.4 years ago
Harry Palmer ▴ 90

Hi all,

I can't figure out how to get perl to substitute the same number of characters AND at the same position as were matched by a regex.

For example, the string "AAAAAAANNNNNNNNNNAAANNNNNNNNNNN";

I want to mask out any contiguous sequence of "A" less than 5 long, where ever they appear.

Thanks for all help, Harry

perl • 6.8k views
ADD COMMENT
4
Entering edit mode
14.4 years ago
Jts ★ 1.4k

There is probably a cleaner way to do it but this is what I came up with:

perl -pe "s/([^A])(A{1,4})([^A])/\1\L\2\E\3/g"

Here is the output for the example you provided:

echo "AAAAAAANNNNNNNNNNAAANNNNNNNNNNN" | perl -pe "s/([^A])(A{1,4})([^A])/\1\L\2\E\3/g"
AAAAAAANNNNNNNNNNaaaNNNNNNNNNNN

The first part matches any non-A character, then at most 4 As followed by another non-A character. All three matches are captured in groups 1, 2, 3. The second part prints out the matches after converting the second match (the run of As) to lower case (using L and E).

Edit: As Michael and Lars point out, there are a few corner cases. It might be simplest to first convert all the As to lower case, then use a much simpler regex to convert the long runs of a back to upper case:

echo "AAAANANANANANAAAAAAANNAAAA" | perl -pe "s/A/a/g" | perl -pe "s/a{5,}/\U$&\E/g"
aaaaNaNaNaNaNAAAAAAANNaaaa
ADD COMMENT
1
Entering edit mode

This works, except for the case of trailing As. See below.

ADD REPLY
0
Entering edit mode

This will also not work with leading As or for cases like NANANAN where only the first and the third instance will be replaced. See below.

ADD REPLY
4
Entering edit mode
14.4 years ago

After programming more Python than Perl, I'd propose to first isolate all the A's (split) and then decide if a string of A's is <5 chars. The map statement applies the regex to each list item. At the end, glue everything together.

#!/usr/bin/perl

use strict;
use warnings;

my $s = "AAANNNNNNNNNNAAANNNNANNNNNNAAAAAAAAAAANA";
my $t = join "", map { s/^(A{1,4})$/\L$1\E/; $_ } split(/(A+)/, $s);

print "$s\n$t\n";
ADD COMMENT
0
Entering edit mode

Seeing that the logic is written backwards, I wonder how this would like in Ruby. Could anyone please translate it?

ADD REPLY
0
Entering edit mode

I like this solution too. And it's guaranteed to be correct.

ADD REPLY
0
Entering edit mode

Good solution. I'll add a Ruby version as an answer.

ADD REPLY
2
Entering edit mode
14.4 years ago
Michael 55k

I think it's not possible to do this in a single and standard POSIX regexp because you have to count and take the result for the replacement. So I take as a kind-of code golf using extended perl regular expressions. The perldoc says this is experimental, and it might be low performance, but:

s/(A+)(?{$r = (length($1) > 4) ? $1 : lc($1)})/$r/g

And here the test comparing to the answer from jts:

$_ = "AAAGAAAAANNNANNNNAANNANNNAAAAAAAAAANNCGTAAA";
s/(A+)(?{$r = (length($1) > 4) ? $1 : lc($1)})/$r/g;
print $_, "\n";
$_ = "AAAGAAAAANNNANNNNAANNANNNAAAAAAAAAANNCGTAAA";
s/([^A])(A{1,4})([^A])/\1\L\2\E\3/g;
print $_, "\n";

Output:

$ perl regexp.pl 
aaaGAAAAANNNaNNNNaaNNaNNNAAAAAAAAAANNCGTaaa
AAAGAAAAANNNaNNNNaaNNaNNNAAAAAAAAAANNCGTAAA
ADD COMMENT
2
Entering edit mode
14.4 years ago
Neilfws 49k

In response to Michael's request for a Ruby version of his code:

#!/usr/bin/ruby

s = "AAANNNNNNNNNNAAANNNNANNNNNNAAAAAAAAAAANA"
t = s.split(/(A+)/).map { |e| /^(A{1,4})$/.match(e) ? e.downcase : e }.join("")
puts "#{s}\n#{t}"

Quite similar: split the string on (A+) into an array, substitute each element for lower case if it matches "AAAA", then glue back together using join.

This is a surprisingly difficult problem using regex alone; I'd be interested to see if there is a "simple" solution.

ADD COMMENT
1
Entering edit mode
14.4 years ago

You can handle both leading, trailing, and intervening stretches less than 5 long with the following regular expression substitution:

s/(^|[^A])(A{1,4})([^A]|$)/\1\L\2\E\3/g

There is one problem, though. In case of the sequence NANA only the first A will be replaced. The reason for this is that the second N is consumed by the third parenthesis in the regular expression, for which reason it cannot be matched by [^A] in the first parenthesis to produce a second match. The only solution that I can think of is to apply the regular expression twice, which is sufficient to ensure that all matches will be masked. In this case one has to remember to make sure that [^A] does not match the a produced by the first round of matching, which is done with a trivial modification:

s/(^|[^Aa])(A{1,4})([^Aa]|$)/\1\L\2\E\3/g

Applying this regular expression twice should do the job. It is a bit of a hack, I admit.

Edit: An alternative solution is to first mask all A-stretches and then subsequently unmask the ones that are 5 or longer:

s/A/a/g;
s/(a{5,})/\U$1\E/;
ADD COMMENT
1
Entering edit mode
14.4 years ago
Neilfws 49k

Here's a Perl solution using the special regex variables @- and @+.

#!/usr/bin/perl -w

use strict;
my $s = "AAANNNNNNNNNNAAANNNNANNNNNNAAAAAAAAAAANA";
print "$s\n";

while($s =~/(A+)/g) {
  if($+[0] - $-[0] < 5) {
    substr($s, $-[0], $+[0] - $-[0], lc($&));
  }
}

print "$s\n";

$-[0] gives the start position of the match (zero-based), $+[0] the end position. We can use those variables to calculate whether the match length is 4 or less. Then we can use them again to supply the OFFSET and LEN arguments to substr(), replacing the string of A with a by converting the match to lowercase.

ADD COMMENT
0
Entering edit mode

wow, I didn't know about this special variable (only about $1, $2 etc.) -- but I guess no-one knows all the Perl variables. Doc: http://perldoc.perl.org/perlvar.html

ADD REPLY

Login before adding your answer.

Traffic: 1637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6