How To Find All Repeated Pattern In One String Using Regular Expression
5
1
Entering edit mode
12.1 years ago
cauyrd ▴ 20

I have a following string:

'0\t36aa, >HWUSI-EAS614_4:1:62:3987:7089:0:1:1... *1\t36aa, >HWUSI-EAS614_4:1:65:3993:10262:0:1:1... at 100.00%2\t36aa, >ILLUMINA-EAS295_5:3:99:4680:15673:0:1:1... at 100.00%3\t36aa, >HWUSI-EAS614_4:1:63:11191:7359:0:1:1... at 94.44%`

I wanna find all pattern between '>' and '...', which are:

HWUSI-EAS614_4:1:62:3987:7089:0:1:1
HWUSI-EAS614_4:1:65:3993:10262:0:1:1
ILLUMINA-EAS295_5:3:99:4680:15673:0:1:1
HWUSI-EAS614_4:1:63:11191:7359:0:1:1

How to write the regular expression, python syntax is preferred.

expression • 6.9k views
ADD COMMENT
3
Entering edit mode
12.1 years ago
JC 13k
perl -lane 'if (m/>(.+?)\.\.\./) { print $1; }' < FILE

edit: ok, no cat

edit2: the original question had one sequence ID per line, now it's showed with all IDs in one line, to satisface the comments below:

perl -lane 'print $1 while (m/>(.+?)\.\.\./g)' < FILE
ADD COMMENT
2
Entering edit mode
ADD REPLY
0
Entering edit mode

Useless use of redirection also, plus it doesn't do what the OP requested. The answer from @qiyunzhu below will print all the matches.

ADD REPLY
0
Entering edit mode

when I wrote my answer, the question had the string with one ID per line, not all in one line

ADD REPLY
0
Entering edit mode

Okay. It's hard to tell when there have been edits, but that first solution would work then.

ADD REPLY
2
Entering edit mode
12.1 years ago

Use cut with the flexible -d parameter.

I assume all of the lines start with > and ends with ... and there won't be a period in between, then

cut -f2 -d'>' FILE | cut -f1 -d'.'

gives you what you want.

Another one using sed ;

sed -e 's/.*>\(.*\)\.\.\..*/\1/g' FILE
ADD COMMENT
1
Entering edit mode
12.1 years ago
qiyunzhu ▴ 430

Here's a Perl solution. The trick is to remove one match per time.

print $1 while s/\>(.+?)\.\.\.//;
ADD COMMENT
1
Entering edit mode
12.1 years ago
Whetting ★ 1.6k

since python syntax was requested...

import re 

string="0\t36aa, >HWUSI-EAS614_4:1:62:3987:7089:0:1:1...
*1\t36aa, >HWUSI-EAS614_4:1:65:3993:10262:0:1:1... at 100.00%2\t36aa, >ILLUMINA-EAS295_5:3:99:4680:15673:0:1:1... at 100.00%3\t36aa, >HWUSI-EAS614_4:1:63:11191:7359:0:1:1... at 94.44%"  


print re.findall(">(.+?)\.\.\.",string)
ADD COMMENT
0
Entering edit mode
12.1 years ago
import re    #the regexp python module

myPattern = re.compile('\>(.+?)\.{3}')
myText ='0\t36aa, >HWUSI-EAS614_4:1:62:3987:7089:0:1:1... *1\t36aa, >HWUSI-EAS614_4:1:65:3993:10262:0:1:1... at 100.00%2\t36aa, >ILLUMINA-EAS295_5:3:99:4680:15673:0:1:1... at 100.00%3\t36aa, >HWUSI-EAS614_4:1:63:11191:7359:0:1:1... at 94.44%'
myPatRes = myPattern.findall(myText)
print myPatRes

['>HWUSI-EAS6144:1:62:3987:7089:0:1:1...', '>HWUSI-EAS6144:1:65:3993:10262:0:1:1...', '>ILLUMINA-EAS2955:3:99:4680:15673:0:1:1...', '>HWUSI-EAS6144:1:63:11191:7359:0:1:1...']

I guess in your case, you want to simplify:

for res in myPatRes:
    print res[1:-3]

HWUSI-EAS614_4:1:62:3987:7089:0:1:1

HWUSI-EAS614_4:1:65:3993:10262:0:1:1

ILLUMINA-EAS295_5:3:99:4680:15673:0:1:1

HWUSI-EAS614_4:1:63:11191:7359:0:1:1

ADD COMMENT
0
Entering edit mode

I realize that while I was writting it, Whetting answers the short (and good!) version. Here is the longer story ;-)

ADD REPLY

Login before adding your answer.

Traffic: 1696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6