Hi, everyone
I have a dataset like this .
1st column has protein Ids of different organisms and 2nd column has domain names.
Protein Ids domain
Abiotrophia_defectiva_peg_0144 wzz
Abiotrophia_defectiva_peg_0198 wxy
Abiotrophia_defectiva_peg_0200 wzz
Abiotrophia_defectiva_peg_0215 wca
Abyssicoccus_albus_peg_1185 wzz
Abyssicoccus_albus_peg_1189 wzx
Abyssicoccus_albus_peg_1200 wza
Abyssicoccus_albus_peg_1322 wca
Abyssicoccus_albus_peg_1324 wbb
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
Bradyrhizobium_elkanii_peg_6752 wca
Bradyrhizobium_elkanii_peg_6780 wvx
I want to know which proteins are near by
means according to "peg" numbers if I say protein numbers coming under +/- 5 they are near by and they form cluster.
Output should look like :
Abiotrophia_defectiva_peg_0198 wxy
Abiotrophia_defectiva_peg_0200 wzz
Abyssicoccus_albus_peg_1185 wzz
Abyssicoccus_albus_peg_1189 wzx
Abyssicoccus_albus_peg_1322 wca
Abyssicoccus_albus_peg_1324 wbb
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
Is there any way I can do this task for my data? I can do manually in excel sheet but my dataset is very large. So I need some script for this
Please do let me know
Thanks
So the desired output is to filter and retain only those genes that are at a distance of +-5 genes from any other gene within the same organism?
While the task at hand is a relatively easy one, a word of caution: Are you sure your protein-encoding genes are always numbered in the correct way? While it might be often the case, it may not be fully consistent. Some genes might be on plasmids, then the numbering breaks and you will get false positives, so if possible also include the replicon id. Also, bacterial replicons are circular, then you might miss out on a few occasions where the cluster spans the origin.
Also, we need to know how many genes there are per organism on average, if there are too many, we get problems with the distance calculation and should rather use interval overlap .
Hi, yes my protein dataset is sequencial in order. I have 3982 organisms and 72553 proteins total. I want those clusters to be made according to organisms . LIke I have mentioned in the desired output. And yes you are absolutely right. but currently I am not concerned about the plasmid part. even though the sequence breaks thats fine . I just want to make a cluster of near by proteins thats already in my dataset .
Ok, so that's ~18 proteins per organism = ~165 comparisons per org. and a total of approximately 660,000 operations. That might be possible with a naive implementation. I can give you a perl script that does that.
HI, Sorry to bother you again.
but can you modify the script accordingly .
I have a dataset like this .
1st coloumn has protein Ids of different organisms and 2nd column has domain names.
Protein Ids domain
Abiotrophia_defectiva_peg_0144 wzz
Abiotrophia_defectiva_peg_0198 wxy
Abiotrophia_defectiva_peg_0200 wzz
Abiotrophia_defectiva_peg_0215 wca
Abyssicoccus_albus_123_peg_1185 wzz
Abyssicoccus_albus_123_peg_1189 wzx
Abyssicoccus_albus_123_peg_1200 wza
Abyssicoccus_albus_123_peg_1322 wca
Abyssicoccus_albus_123_peg_1324 wbb
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
Bradyrhizobium_elkanii_peg_6752 wca
Bradyrhizobium_elkanii_peg_6780 wvx
I want to know whic proteins are near by . means according to "peg" numbers if I say protein numbers coming under +/- 5 they are near by and they form cluster. Please keep +/-5 flexible. I might need +/-10 later.
output should look like :
Abiotrophia_defectiva_peg_0198 wxy
Abiotrophia_defectiva_peg_0200 wzz
Abyssicoccus_albus_123_peg_1185 wzz
Abyssicoccus_albus_123_peg_1189 wzx
Abyssicoccus_albus_123_peg_1322 wca
Abyssicoccus_albus_123_peg_1324 wbb
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
there should be separating line for each cluster