Hello everyone, I have come for some time looking for some way to determine (using bioinformatics) the c-terminal and n-terminal regions of a non-model organisms (in this case a plant). The proteome has more than 30,000 proteins.
I read some questions here in the group, but nothing that could help me in this matter. One methodology I created was using Interproscan to find functional signatures (Interpro signature or database annotation) that contained the term N-terminal or C-terminal.
So for any protein that has the term N-terminal I take the final position of the signature in the protein and consider that from the beginning to the final position found is the N-terminal region. (if there is more than one signature containing the term N-terminal, the highest position is considered at all). Thus, if a signature containing the term N-terminal ends at 600, the N-terminal region is considered from 1 to 600. For any protein having the term C-terminal I take the initial position of the protein signature and consider that from this position to the end of the protein is the C-terminal region. (if there is more than one signature containing the term C-terminal, the lowest position is considered the initial). Therefore, if a signature containing the term C-terminal, begins at 400, the C-terminal region begins at 400 and ends at the end sequence.
However, I am not very convinced by this approach, because in my analyzes (I am wondering how many specific amino acids are in each region) is very different from what I saw in scientific papers (dealing with the same subject I am researching) on ​​proteomes, however researchers used organisms well-characterized models.
My object of study has more than 30000 proteins, however about 2000 proteins had a signature that contained the term N or C terminal.
I do not know if the way I determined the approximate size of the N and C regions in the proteins is correct, since I can have very large regions (secretion signatures, often only <40 aa in the N-terminal portion, for example) or very small.
A suggestion given by a colleague, would I determine a depth difference (determine a value, analyze the 100 residues at the N and C ends for proteins above 1000, 200 for proteins above 2000), or determine a fixed value for each region , for example 100 residues for each end.
Can anyone tell me if there is a way to use bioinformatics to help me solve this problem?