Question

Linux shell script which can do this task

0

Entering edit mode

2.8 years ago

Confused_human ▴ 30

Hi, everyone .

I have a dataset like this .

1st column has protein Ids of different organisms and 2nd column has domain names.

Protein Ids domain
Abiotrophia_defectiva_peg_0144  wzz
Abiotrophia_defectiva_peg_0198  wxy
Abiotrophia_defectiva_peg_0200  wzz
Abiotrophia_defectiva_peg_0215  wca
Abyssicoccus_albus_123_peg_1185 wzz
Abyssicoccus_albus_123_peg_1189 wzx
Abyssicoccus_albus_123_peg_1200 wza
Abyssicoccus_albus_123_peg_1322 wca
Abyssicoccus_albus_123_peg_1324 wbb
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
Bradyrhizobium_elkanii_peg_6752 wca
Bradyrhizobium_elkanii_peg_6780 wvx

I want to know which proteins are near by . means according to "peg" numbers if I say protein numbers coming under +/- 5 they are near by and they form cluster.

output should look like :

Abiotrophia_defectiva_peg_0198 wxy
Abiotrophia_defectiva_peg_0200 wzz
----------------------------------------
Abyssicoccus_albus_123_peg_1185 wzz
Abyssicoccus_albus_123_peg_1189 wzx
Abyssicoccus_albus_123_peg_1322 wca
Abyssicoccus_albus_123_peg_1324 wbb
-----------------------------------------
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
-------------------------------------------

there should be a separate inf partition line between each cluster.

Is there any way I can do this task for my data . I can do manually in excel sheet but my dataset is very large. So I need some script for this .

Please do let me know . Thanks

shell-scripting Linux • 781 views

ADD COMMENT • link updated 2.8 years ago by mti193 ▴ 10 • written 2.8 years ago by Confused_human ▴ 30

0

Entering edit mode

you already asked this. Cluster of neighboring genes by index (Looking for linux shell Script)

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

score 0 · Answer 1 · 2022-03-23

You can do this with python pretty easily.

1) loop through the file line by line 2) Split the line by the first column by "_" and select the last "item" which in this case will be the numbers after "peg_". Python syntax for this is: peg_number = line.split("_")[-1]. This will grab the numbers you want (0144, 0200, etc.) 3) You would now want to store this number in a list (append), or dictionary and then check to see any of the following peg_numbers are within +/- 5 of the value.