WGS project provide a lot of data. In NCBI there are many organism's proteome data and most of the proteins are hypothetical. But the length of protein range from 50 to 13000 aa length. During Literature search I have found that in most of the research paper hypothetical proteins are randomly selected and are annotated. I want to annotate all hypothetical proteins of particular pathogen but many hypothetical proteins are ranging from the length of 50 to 200 aa. What should be the appropriate length of hypothetical proteins that can further annotate. 150 AA or >200 AA???.
Thank You so much JRJ.Healey
Actually I have downloaded hypothetical protein data of protozoan from NCBI and total hypothetical protein of the protozoan is about 4500. Out of 4500, 628 proteins are less than 50 aa ranging from 33 aa to 49 aa. There can be small length proteins but 628 proteins ???.
Randomly selected means, During Literature search I have found that in most of research paper regarding hypothetical protein annotation, the length of protein was not mentioned and hypothetical proteins were randomly selected for annotation i,e. without mentioning the length of proteins.
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.This comment belongs under @jrj.healey's answer.
Sorry to contribute to messing up the organisation of the thread, but i'll try to keep the comments all together at least.
My suggestion to you @jot87c, would be that you do some manual curation of the proteins first to see if you trust them or not. For sure, some of them will probably be false positives, but you should do something like a low stringency PSI-BLAST or similar to see if there are any very well known similarities to the proteins to tell you if they're likely to be real or not.
Stop focussing on just the length of the protein - it's not really that useful much of the time - you need to be cleverer.
You also need to consider what question you're asking. Does it actually matter if a small fraction of your proteins are false positives? What's the actual research aim?
As far as I have understood, you have sequences
of several or even many hypothetical proteins from some pathogen?
To annotate them you need to find their known orthologous proteins.
Forget for some time about their length or amounts.
Try the following orthologous database:
https://omabrowser.org/oma/home/
Change 'IDENTIFIER' to 'PROTEIN SEQUENCE'
Insert any of your protein sequences
Run it, and OMA will give you some homologous proteins with close sequences.
Some of them will be real proteins you will be able to study.
Then try other proteins.
Hopefully it may help.