Entering edit mode
2.8 years ago
Confused_human
▴
30
HI everyone,
I have some dataset like this :
column1 - protein IDs
column2 - domain
Agarivorans_aestuarii_peg_2571 NODOMAIN
Agarivorans_aestuarii_peg_2572 Wzy_C
Agarivorans_aestuarii_peg_2573 NODOMAIN
Agarivorans_aestuarii_peg_2574 Polysacc_synt_C
Agarivorans_aestuarii_peg_2575 NODOMAIN
Agarivorans_aestuarii_peg_2576 Caps_syn_GfcC_N Caps_syn_GfcC_C
Agarivorans_aestuarii_peg_2577 Polysacc_synt_2
Agarivorans_aestuarii_peg_2578 NODOMAIN
--
Aliivibrio_salmonicida_peg_0270 Wzy_C
Aliivibrio_salmonicida_peg_0271 NODOMAIN
Aliivibrio_salmonicida_peg_0272 NODOMAIN
Aliivibrio_salmonicida_peg_0273 NODOMAIN
Aliivibrio_salmonicida_peg_0274 Caps_syn_GfcC_N Caps_syn_GfcC_C
Aliivibrio_salmonicida_peg_0275 NODOMAIN
Aliivibrio_salmonicida_peg_0278 Poly_export Caps_syn_GfcC_C
Aliivibrio_salmonicida_peg_079 Wzz
Aliivibrio_salmonicida_peg_0280 NODOMAIN
Output should look like :
Agarivorans_aestuarii_peg_2572 Wzy_C
Agarivorans_aestuarii_peg_2573 NODOMAIN
Agarivorans_aestuarii_peg_2574 Polysacc_synt_C
Agarivorans_aestuarii_peg_2575 NODOMAIN
Agarivorans_aestuarii_peg_2576 Caps_syn_GfcC_N Caps_syn_GfcC_C
Agarivorans_aestuarii_peg_2577 Polysacc_synt_2
--
Aliivibrio_salmonicida_peg_0270 Wzy_C
Aliivibrio_salmonicida_peg_0271 NODOMAIN
Aliivibrio_salmonicida_peg_0272 NODOMAIN
Aliivibrio_salmonicida_peg_0273 NODOMAIN
Aliivibrio_salmonicida_peg_0274 Caps_syn_GfcC_N Caps_syn_GfcC_C
Aliivibrio_salmonicida_peg_0275 NODOMAIN
Aliivibrio_salmonicida_peg_0278 Poly_export Caps_syn_GfcC_C
Aliivibrio_salmonicida_peg_079 Wzz
Everything that is coming in between two known domains will get printed and everything else will get removed.
and if in it is like this :
Atlantibacter_hermannii_peg_1261 NODOMAIN
Atlantibacter_hermannii_peg_1262 NODOMAIN
Atlantibacter_hermannii_peg_1263 NODOMAIN
Atlantibacter_hermannii_peg_1264 Wzz
Atlantibacter_hermannii_peg_1265 NODOMAIN
then it will give output :
Atlantibacter_hermannii_peg_1264 Wzz
Please tell me how this can be done with linux shell scripting or any other script.
I have tried sort and uniq but it did not work fo this as the protein peg IDs are different.
and there is a partitioning line like "--" between each cluster like the way I have showed in the above example data.
Thank you
Post is confusing. Can you rephrase it with smaller example and with only one query per post?
OKAY
So my dataset has two columns
1st col - protein IDs
2nd col - domain
input -
Output -
it will print all the entries in between two known domains. like for this case everything in between wzz and wza will get printed .
this should be applied for all the entries .
and these protein clusters are separated by "--"
and if its like this -
it will give output
because only one known domain is there in that cluster