I have a large tab-delimited file and a part of it is like:
25 M X A A X S
25_a M K A A R S
25_b M A A A V S
31 M A A A V S
31_a M A A A V S
31_b M A A A V S
I am trying to play with three rows at a time, the first row contains a reference sequence (actual sequence) whereas the next two rows reflect its variants. I am trying to do two things:
First thing is that from the first row (reference line (25)), I am trying to identify (match) a character (X) and trying to only keep the corresponding characters in the bottom two rows (25_a, 25_b) to get something like shown below,
25 M X A A X S
25_a K R
25_b A V
Secondly, If there is no (X) in the reference (31) line, then remove the corresponding two rows (31_a, 31_b) to get something like this:
31 M A A A V S
And a final output should be like
25 M X A A X S
25_a K R
25_b A V
31 M A A A V S
I have tried to use sed command which allowed me to remove data after X character within same row but I am struggling to get the desired output. I have also posted the question here but they closed my question because i was not able to explain well. Any help will be highly appreciated
I assume you don't know a programming language?
A python3 solution would be something like this (not tested):
Edit: semantics