Having some regex problems capturing strings with special chars. Could use some help.
1
0
Entering edit mode
4.2 years ago

Having a bit of trouble reformatting this messed up run log. I want to remove the strings of characters that did not translate correctly from linux terminal stdout into the log file and then replace those string with a \t, a \n, or white space. Doing it for a large number of files, so I need a command line solution.

Log sample:

The following malformed strings repeat for every entry in the log:

  • ^[[3J^[[H^[[2J^[[1;33m
  • ^[[0m^[[0;33m
  • ^[[0m^[[1;33m
  • ^[[0m|^H/^H-^H^H
  • ^[[1;37m
  • ^[[0m^[[0;37m
  • ^[[0m^[[1;37m
  • ^[[0m^[[0;37m
  • ^[[0m^[[0;37m^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^[[0m^[[0;37m
  • ^[[0m^[[1;32m
  • ^[[0m^[[0;32m

I've tried numerous gnu sed regexs to try to capture these with escaped special chars, but I keep getting 's/ ' unterminated errors (I think mainly due to that opening ^ in the strings?). Any pointers on how to go about doing this with sed or awk? Is there an easier way, perhaps with some sort of a find and replace python/perl script?

This is my current regex:

sed 's/\^\[\[3J\^\[\[H\^\[\[2J\^\[\[1;33m//g; s/\^\[\[0m\^\[\[0;33m//g; s/\^\[\[0m\^\[\[1;33m//g; s/\^\[\[0m|\^H\/\^H\-\^H\^H//g; s/\^\[\[1;37m//g; s/\^\[\[0m\^\[\[0;37m//g; s/\^\[\[0m//g; s/\^H//g; s/\^\[\[1;32m//g; s/\^\[\[0;32m//g' run.log > run_clean.log
sed awk regex • 1.8k views
ADD COMMENT
0
Entering edit mode

I tried your command on a sample file and it worked for me.

Fatima-MacBook-Pro:~ Fatima$ cat tmp
^[[3J^[[H^[[2J^[[1;33m
^[[0m^[[0;33m
^[[0m^[[1;33m
^[[0m|^H/^H-^H^H
^[[1;37m
^[[0m^[[0;37m
^[[0m^[[1;37m
^[[0m^[[0;37m
^[[0m^[[0;37m^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^[[0m^[[0;37m
^[[0m^[[1;32m
^[[0m^[[0;32m

Fatima-MacBook-Pro:~ Fatima$ sed 's/\^\[\[3J\^\[\[H\^\[\[2J\^\[\[1;33m//g; s/\^\[\[0m\^\[\[0;33m//g; s/\^\[\[0m\^\[\[1;33m//g; s/\^\[\[0m|\^H\/\^H\-\^H\^H//g; s/\^\[\[1;37m//g; s/\^\[\[0m\^\[\[0;37m//g; s/\^\[\[0m//g; s/\^H//g; s/\^\[\[1;32m//g; s/\^\[\[0;32m//g' tmp

This link might help:

https://unix.stackexchange.com/questions/14684/removing-control-chars-including-console-codes-colours-from-script-output

ADD REPLY
0
Entering edit mode

Helpful to know it works for you and that my regex is at least correct. Something else is going wrong then I suppose.

Based on your suggestion about color codes, I think the answer might be due to the fact that sed is a stream editor and these are terminal ansi codes. If you cat the log file, the progress bar representations and colors show up as shown below.

https://pasteboard.co/JvYUOyh.png

So sed can't recognize the codes because it is essentially reading the file like cat.

ADD REPLY
0
Entering edit mode

Is this a bioinformatics question?

ADD REPLY
0
Entering edit mode

More of a raw data skills question sure. I'm working on a bioinformatics pipeline of mtdna deletion calling using eKLIPse deletion caller. So yes, it is related to bioinformatics in that I'm trying to clean up the eKLIPse logs.

ADD REPLY
1
Entering edit mode
4.2 years ago

I can think of simplifying the regex a little bit using perl, in case it helps:

perl -pe 's/\^\[\[(2J|3J|H|[01](;3[237])?m)//g; s/\^H//g; s/\|\/-//' run.log > run_clean.log
ADD COMMENT
0
Entering edit mode

Thanks jorge, I'll try it out.

ADD REPLY

Login before adding your answer.

Traffic: 3061 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6