Hi everyone :)
I have a file that contains information in each line like:
ID=1234_1 Name=First size_aa=7890 start_type=none Value=0.123
ID=1234_2 Name=Second size_aa=7969 start_type=none Value=0.122
ID=1233 Name=Third size_aa=753 start_type=ft Value=0.223
ID=445 Class=ED size_aa=4653 start_type=fp Value=0.223
...The space ' ' is supposed to be representative for a tab
...
I would like to split it and get a file like:
ID Name size_aa start_type Value
1234_1 First 7890 none 0.123
1234_2 Second 7969 none 0.122
1233 Third 753 ft 0.223
445 Class=ED 4653 fp 0.223
I have tried different things but I never quiet get there and as I have gotten really nice tips the last two times that I asked for help on Biostars I decided to ask again... I hope you can help me out! Any help will be appreciated :)
P.S.: My approaches so far were built on the idea of splitting the file in to two files
. One part would be used to work on the header
, the other to work on the data
. Once everything was deleted in-between a '=' and a tab
, only the headers would remain. Then I would look for these tab separated words within the second part of the file and delete occurrences of these strings (including the '=') leaving only the value behind.
This seems overly complicated to me though... There is probably an easier solution!?
Thank you!
The approach I tossed in my answer will only solve part of your question, I just noticed.
so you want the "column " names only once in the output file? You are aware that some columns have different names apparently?
Yes, I want the "column" names only once in the output file. The case that some columns have different names is a
worst case scenario
. It should not be, but is possible. Which is why I was thinking about that in the case of ananomaly
(like inline 4 "Class=ED"
instead of "Name=XY"): I would like tokeep the identifier
'Something=' with the value or even create a new column for this information at the end of the table and leave a blank space here.This is why I would like to have the columns name as "Name" even though the entry in row 4 suggests differently.