Entering edit mode
6.3 years ago
Yingzi Zhang
▴
90
Hi all, I don't know whether it's polite to ask this direct simple question in biostars. But it do trouble me the whole day.
My raw data is like this:
ENSSSCP00000055957.1 Protein of unknown function (DUF1466) RCS1 #N/A
ENSSSCP00000041172.1 Ras family Small GTPase superfamily GO:0003924|GO:0005525
ENSSSCP00000041839.1 Sugar-tranasporters, 12 TM Molybdate-anion transporter GO:0015098|GO:0015689|GO:0016021
ENSSSCP00000004168.3 Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter GO:0015293|GO:0016021
ENSSSCP00000040645.1 mTERF Transcription termination factor, mitochondrial/chloroplastic GO:0003690|GO:0006355
I want to grasp all the letters before "GO:" or "#N/A". The result expected should be like this:
ENSSSCP00000055957.1 Protein of unknown function (DUF1466) RCS1
ENSSSCP00000041172.1 Ras family Small GTPase superfamily
ENSSSCP00000041839.1 Sugar-tranasporters, 12 TM Molybdate-anion transporter
ENSSSCP00000004168.3 Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter
ENSSSCP00000040645.1 mTERF Transcription termination factor, mitochondrial/chloroplastic
The scprit I wrote was:
for line in rawdata:
value1 = re.search("^(^'GO:']+)",line)
value2 = re.search("^(^'#N/A']+)",line)
if value1:
print(value1.group(1))
if value2:
print(value2.group(1))
No error reported but output empty also. How please? Thank you for you patience.
Yingzi
Yingzi Zhang well, though you requested for python solution, here is another solution in bash:
if your text is well formatted, you would have to simply exclude last column (based on OP text) some thing like this:
In python you can use zerolength assertions as well (mind the indent and test.txt is text in OP):
Thank you so much. :)
Another solution using bash:
Why you post you answer as a comment and not as in answer @ cpad0112 ?
fin swimmer
Thank you for the suggestion. I couldn't find a answer button. I could only see "ADD REPLY" so I clicked it. sorry, are you able to modify this?
Yingzi