linux shell script which can do this task
0
0
Entering edit mode
2.8 years ago

HI everyone,

I have some dataset like this :

column1 - protein IDs

column2 - domain

Agarivorans_aestuarii_peg_2571  NODOMAIN
Agarivorans_aestuarii_peg_2572  Wzy_C   
Agarivorans_aestuarii_peg_2573  NODOMAIN
Agarivorans_aestuarii_peg_2574  Polysacc_synt_C 
Agarivorans_aestuarii_peg_2575  NODOMAIN
Agarivorans_aestuarii_peg_2576  Caps_syn_GfcC_N Caps_syn_GfcC_C
Agarivorans_aestuarii_peg_2577  Polysacc_synt_2 
Agarivorans_aestuarii_peg_2578  NODOMAIN
--
Aliivibrio_salmonicida_peg_0270 Wzy_C   
Aliivibrio_salmonicida_peg_0271         NODOMAIN
Aliivibrio_salmonicida_peg_0272       NODOMAIN
Aliivibrio_salmonicida_peg_0273        NODOMAIN
Aliivibrio_salmonicida_peg_0274      Caps_syn_GfcC_N Caps_syn_GfcC_C
Aliivibrio_salmonicida_peg_0275        NODOMAIN
Aliivibrio_salmonicida_peg_0278       Poly_export     Caps_syn_GfcC_C
Aliivibrio_salmonicida_peg_079         Wzz     
Aliivibrio_salmonicida_peg_0280        NODOMAIN

Output should look like :

Agarivorans_aestuarii_peg_2572  Wzy_C   
Agarivorans_aestuarii_peg_2573  NODOMAIN
Agarivorans_aestuarii_peg_2574  Polysacc_synt_C 
Agarivorans_aestuarii_peg_2575  NODOMAIN
Agarivorans_aestuarii_peg_2576  Caps_syn_GfcC_N Caps_syn_GfcC_C
Agarivorans_aestuarii_peg_2577  Polysacc_synt_2 
--
Aliivibrio_salmonicida_peg_0270 Wzy_C   
Aliivibrio_salmonicida_peg_0271         NODOMAIN
Aliivibrio_salmonicida_peg_0272       NODOMAIN
Aliivibrio_salmonicida_peg_0273        NODOMAIN
Aliivibrio_salmonicida_peg_0274      Caps_syn_GfcC_N Caps_syn_GfcC_C
Aliivibrio_salmonicida_peg_0275        NODOMAIN
Aliivibrio_salmonicida_peg_0278       Poly_export     Caps_syn_GfcC_C
Aliivibrio_salmonicida_peg_079         Wzz

Everything that is coming in between two known domains will get printed and everything else will get removed.

and if in it is like this :

Atlantibacter_hermannii_peg_1261        NODOMAIN
Atlantibacter_hermannii_peg_1262        NODOMAIN
Atlantibacter_hermannii_peg_1263        NODOMAIN
Atlantibacter_hermannii_peg_1264        Wzz
Atlantibacter_hermannii_peg_1265        NODOMAIN

then it will give output :

Atlantibacter_hermannii_peg_1264        Wzz

Please tell me how this can be done with linux shell scripting or any other script.

I have tried sort and uniq but it did not work fo this as the protein peg IDs are different.

and there is a partitioning line like "--" between each cluster like the way I have showed in the above example data.

Thank you

shell-script Linux • 625 views
ADD COMMENT
1
Entering edit mode

Post is confusing. Can you rephrase it with smaller example and with only one query per post?

ADD REPLY
0
Entering edit mode

OKAY

So my dataset has two columns

1st col - protein IDs

2nd col - domain

input -

Abiotrophia_defectiva_peg_1828  NODOMAIN
Abiotrophia_defectiva_peg_1829  wzz
Abiotrophia_defectiva_peg_1830  NODOMAIN
Abiotrophia_defectiva_peg_1831  wzx
Abiotrophia_defectiva_peg_1832 wza
Abiotrophia_defectiva_peg_1833 NODOMAIN

Output -

Abiotrophia_defectiva_peg_1829  wzz
Abiotrophia_defectiva_peg_1830  NODOMAIN
Abiotrophia_defectiva_peg_1831  wzx
Abiotrophia_defectiva_peg_1832 wza

it will print all the entries in between two known domains. like for this case everything in between wzz and wza will get printed .

this should be applied for all the entries .

and these protein clusters are separated by "--"

and if its like this -

Atlantibacter_hermannii_peg_1262 NODOMAIN
Atlantibacter_hermannii_peg_1263 NODOMAIN
Atlantibacter_hermannii_peg_1264 Wzz
Atlantibacter_hermannii_peg_1265 NODOMAIN

it will give output

Atlantibacter_hermannii_peg_1264 Wzz

because only one known domain is there in that cluster

ADD REPLY

Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6