Hi, everyone .
I have a dataset like this .
1st column has protein Ids of different organisms and 2nd column has domain names.
Protein Ids domain
Abiotrophia_defectiva_peg_0144 wzz
Abiotrophia_defectiva_peg_0198 wxy
Abiotrophia_defectiva_peg_0200 wzz
Abiotrophia_defectiva_peg_0215 wca
Abyssicoccus_albus_123_peg_1185 wzz
Abyssicoccus_albus_123_peg_1189 wzx
Abyssicoccus_albus_123_peg_1200 wza
Abyssicoccus_albus_123_peg_1322 wca
Abyssicoccus_albus_123_peg_1324 wbb
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
Bradyrhizobium_elkanii_peg_6752 wca
Bradyrhizobium_elkanii_peg_6780 wvx
I want to know which proteins are near by . means according to "peg" numbers if I say protein numbers coming under +/- 5 they are near by and they form cluster.
output should look like :
Abiotrophia_defectiva_peg_0198 wxy
Abiotrophia_defectiva_peg_0200 wzz
----------------------------------------
Abyssicoccus_albus_123_peg_1185 wzz
Abyssicoccus_albus_123_peg_1189 wzx
Abyssicoccus_albus_123_peg_1322 wca
Abyssicoccus_albus_123_peg_1324 wbb
-----------------------------------------
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
-------------------------------------------
there should be a separate inf partition line between each cluster.
Is there any way I can do this task for my data . I can do manually in excel sheet but my dataset is very large. So I need some script for this .
Please do let me know . Thanks
you already asked this. Cluster of neighboring genes by index (Looking for linux shell Script)