Question

how to use regular expression to grasp information before "GO:" or "#N/A"

0

Entering edit mode

6.4 years ago

Yingzi Zhang ▴ 90

Hi all, I don't know whether it's polite to ask this direct simple question in biostars. But it do trouble me the whole day.

My raw data is like this:

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    #N/A
ENSSSCP00000041172.1    Ras family      Small GTPase superfamily        GO:0003924|GO:0005525
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM      Molybdate-anion transporter     GO:0015098|GO:0015689|GO:0016021
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  GO:0015293|GO:0016021
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic   GO:0003690|GO:0006355

I want to grasp all the letters before "GO:" or "#N/A". The result expected should be like this:

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1
ENSSSCP00000041172.1    Ras family      Small GTPase superfamily
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM      Molybdate-anion transporter    
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic

The scprit I wrote was:

for line in rawdata:
    value1 = re.search("^(^'GO:']+)",line)
    value2 = re.search("^(^'#N/A']+)",line)
    if value1:
        print(value1.group(1))
    if value2:
        print(value2.group(1))

No error reported but output empty also. How please? Thank you for you patience.

Yingzi

python • 1.4k views

ADD COMMENT • link 6.4 years ago by Yingzi Zhang ▴ 90

1

Entering edit mode

for line in rawdata:
    value1 = re.search("GO.*",line)
    value2 = re.search("#N/A.*",line)
    if value1:
        print(value1.group(0))
    if value2:
        print(value2.group(0))

ADD REPLY • link 6.4 years ago by mohammadhassanj ▴ 260

1

Entering edit mode

Yingzi Zhang well, though you requested for python solution, here is another solution in bash:

$  grep -Po -i '.*\t(?=[GO:|\#N]*)' test.txt

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    
ENSSSCP00000041172.1    Ras family  Small GTPase superfamily    
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM  Molybdate-anion transporter 
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic

if your text is well formatted, you would have to simply exclude last column (based on OP text) some thing like this:

$ awk '{$NF=""}1' test.txt 
ENSSSCP00000055957.1 Protein of unknown function (DUF1466) RCS1 
ENSSSCP00000041172.1 Ras family Small GTPase superfamily 
ENSSSCP00000041839.1 Sugar-tranasporters, 12 TM Molybdate-anion transporter 
ENSSSCP00000004168.3 Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter 
ENSSSCP00000040645.1 mTERF Transcription termination factor, mitochondrial/chloroplastic

In python you can use zerolength assertions as well (mind the indent and test.txt is text in OP):

>import re
>with open("test.txt", "r") as f:
    test = f.readlines()
>out = [re.search(r'.*\t(?=[GO|\\#])', i).group(0) for i in test]
>print(*out, sep='\n')


ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    
ENSSSCP00000041172.1    Ras family  Small GTPase superfamily    
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM  Molybdate-anion transporter 
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

Thank you so much. :)

ADD REPLY • link 6.4 years ago by Yingzi Zhang ▴ 90

0

Entering edit mode

Another solution using bash:

$ cut -f1-3 test.txt

Why you post you answer as a comment and not as in answer @ cpad0112 ?

fin swimmer

ADD REPLY • link 6.4 years ago by finswimmer 16k

0

Entering edit mode

Thank you for the suggestion. I couldn't find a answer button. I could only see "ADD REPLY" so I clicked it. sorry, are you able to modify this?

Yingzi

ADD REPLY • link 6.4 years ago by Yingzi Zhang ▴ 90