Question

How to remove characters after specific symbol from all columns, and make all charachters sperated by space and comma except the first column

0

Entering edit mode

4.5 years ago

Hann ▴ 110

Hi all,

I am trying to modify the formate of a big file:

The file is tab-delimited Here how the file looks like:

AB11.1  CB:0078_0.53    CB:0044464_0.42   CB:0005623_0.466
AB10.1  
AB01.2  CB:0036_0.4   CB:0003824_0.4       CB:0005575_0.7    CB:0005622_0.2 CB:0005623_0.6
AB01.2  CB:0036_0.3   CB:0003824_0.43      CB:0005575_0.7    CB:0005622_0.1

Please note that the number of columns for each row is not identical. The number of columns can be more than 400 or it can be only 1, and some few rows are empty like for the ID: AB10.1

I want to modify the formate first by removing all characters that come after this symbol _ including the symbol itself. Then modify the separators:

1- Only after the first column it is separated by tab-delimited

2- Starting from the second till the last column they should be separated by a comma and then space

So output file should look like this:

AB11.1    CB:0078, CB:0044464, CB:0005623
AB10.1  
AB01.2    CB:0036, CB:0003824, CB:0005575, CB:0005622, CB:0005623
AB01.2    CB:0036, CB:0003824, CB:0005575, CB:0005622

How to do that in a bash script (I have super basic knowledge)? or maybe python (never used it)?

bash • 841 views

ADD COMMENT • link updated 4.5 years ago by Ram 44k • written 4.5 years ago by Hann ▴ 110

score 0 · Answer 1 · 2020-06-01

0

Entering edit mode

4.5 years ago

Ram 44k

Use sed for requirement 1. You want to remove all _\S+ (or if your format only has numbers and . following underscore, remove all _[0-9_]+.

Use awk or perl for the second requirement. It will be a bit tricky (you may have to loop from 2 to NF), but it will be easier than using R or learning python.

ADD COMMENT • link 4.5 years ago by Ram 44k

0

Entering edit mode

Yes, I managed to do it with awk and sed;

To remove the last 6 characters from a file in each column awk '{for(i=1;i<=NF;i++) sub(/......$/,X,$i)}1'

ADD REPLY • link 4.5 years ago by Hann ▴ 110

0

Entering edit mode

That assumed you'll need to remove exactly 6 characters from each field, which doesn't seem to be the case. Please be careful with such assumptions.