how to use regular expression to grasp information before "GO:" or "#N/A"
0
0
Entering edit mode
6.3 years ago
Yingzi Zhang ▴ 90

Hi all, I don't know whether it's polite to ask this direct simple question in biostars. But it do trouble me the whole day.

My raw data is like this:

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    #N/A
ENSSSCP00000041172.1    Ras family      Small GTPase superfamily        GO:0003924|GO:0005525
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM      Molybdate-anion transporter     GO:0015098|GO:0015689|GO:0016021
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  GO:0015293|GO:0016021
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic   GO:0003690|GO:0006355

I want to grasp all the letters before "GO:" or "#N/A". The result expected should be like this:

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1
ENSSSCP00000041172.1    Ras family      Small GTPase superfamily
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM      Molybdate-anion transporter    
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic

The scprit I wrote was:

for line in rawdata:
    value1 = re.search("^(^'GO:']+)",line)
    value2 = re.search("^(^'#N/A']+)",line)
    if value1:
        print(value1.group(1))
    if value2:
        print(value2.group(1))

No error reported but output empty also. How please? Thank you for you patience.

Yingzi

python • 1.4k views
ADD COMMENT
1
Entering edit mode
for line in rawdata:
    value1 = re.search("GO.*",line)
    value2 = re.search("#N/A.*",line)
    if value1:
        print(value1.group(0))
    if value2:
        print(value2.group(0))
ADD REPLY
1
Entering edit mode

Yingzi Zhang well, though you requested for python solution, here is another solution in bash:

$  grep -Po -i '.*\t(?=[GO:|\#N]*)' test.txt

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    
ENSSSCP00000041172.1    Ras family  Small GTPase superfamily    
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM  Molybdate-anion transporter 
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic

if your text is well formatted, you would have to simply exclude last column (based on OP text) some thing like this:

$ awk '{$NF=""}1' test.txt 
ENSSSCP00000055957.1 Protein of unknown function (DUF1466) RCS1 
ENSSSCP00000041172.1 Ras family Small GTPase superfamily 
ENSSSCP00000041839.1 Sugar-tranasporters, 12 TM Molybdate-anion transporter 
ENSSSCP00000004168.3 Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter 
ENSSSCP00000040645.1 mTERF Transcription termination factor, mitochondrial/chloroplastic

In python you can use zerolength assertions as well (mind the indent and test.txt is text in OP):

>import re
>with open("test.txt", "r") as f:
    test = f.readlines()
>out = [re.search(r'.*\t(?=[GO|\\#])', i).group(0) for i in test]
>print(*out, sep='\n')


ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    
ENSSSCP00000041172.1    Ras family  Small GTPase superfamily    
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM  Molybdate-anion transporter 
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic
ADD REPLY
0
Entering edit mode

Thank you so much. :)

ADD REPLY
0
Entering edit mode

Another solution using bash:

$ cut -f1-3 test.txt

Why you post you answer as a comment and not as in answer @ cpad0112 ?

fin swimmer

ADD REPLY
0
Entering edit mode

Thank you for the suggestion. I couldn't find a answer button. I could only see "ADD REPLY" so I clicked it. sorry, are you able to modify this?

Yingzi

ADD REPLY

Login before adding your answer.

Traffic: 2607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6