Hi all,
I am trying to modify the formate of a big file:
The file is tab-delimited Here how the file looks like:
AB11.1 CB:0078_0.53 CB:0044464_0.42 CB:0005623_0.466
AB10.1
AB01.2 CB:0036_0.4 CB:0003824_0.4 CB:0005575_0.7 CB:0005622_0.2 CB:0005623_0.6
AB01.2 CB:0036_0.3 CB:0003824_0.43 CB:0005575_0.7 CB:0005622_0.1
Please note that the number of columns for each row is not identical. The number of columns can be more than 400 or it can be only 1, and some few rows are empty like for the ID: AB10.1
I want to modify the formate first by removing all characters that come after this symbol _
including the symbol itself.
Then modify the separators:
1- Only after the first column it is separated by tab-delimited
2- Starting from the second till the last column they should be separated by a comma and then space
So output file should look like this:
AB11.1 CB:0078, CB:0044464, CB:0005623
AB10.1
AB01.2 CB:0036, CB:0003824, CB:0005575, CB:0005622, CB:0005623
AB01.2 CB:0036, CB:0003824, CB:0005575, CB:0005622
How to do that in a bash script (I have super basic knowledge)? or maybe python (never used it)?
Yes, I managed to do it with awk and sed;
To remove the last 6 characters from a file in each column
awk '{for(i=1;i<=NF;i++) sub(/......$/,X,$i)}1'
That assumed you'll need to remove exactly 6 characters from each field, which doesn't seem to be the case. Please be careful with such assumptions.
Due to this, first column will be removed as there are 6 characters only.