I'm working on software to allow users to blast an entire proteome against a manually curated database in order to attempt to quantify the composition of that organisms proteins against our database. Usually multiple organisms are usually analyzed using this method and then the results are contrasted between each organism and relations are hypothesized.
However, to me, this seems like rather shallow data. I'd like to find a more quantifiable metric so that I can say that there is a good relation between the query and hit proteins. I've gone to quantify the number of hydrophilic residues in each protein since it's an important metric in my database. However I'm having a difficult time scoring these results using the e-value from blast and this count.
Can anyone refer me to similar papers which have done this or suggest methods of further analysis?
Why do you want to do this with BLAST? There is dedicated software.
Michael, I am not sure what kind of software you mean. Assuming he really has protein sequences and not protein identifiers BLAST (tblastn) might be a good way to annotate the sequences. On the other hand many proteomics approaches indeed already give identifiers, in which case runningBLAST doesn't make sense.
True if he indeed has protein sequences. However, proteomics is mostly done with MS methods that give peptide sequences and not whole proteins. Inferring protein quantities from those alone is mostly guesswork IMO.
I'm using a "complete proteome" from embl-ebi. Eg: http://www.ebi.ac.uk/integr8/FtpSearch.do;jsessionid=CBD21F2A85C32065DD7A40EA1823B4AE?orgProteomeId=137