Hi,
I am trying to analyze MS data for novel proteins. So far my workflow has been to take mzml files and convert them to mgf files. I then search the mgf files against a fasta file containing annotated protein sequences from Uniprot (this amounts to ~20,000 proteins). The search is done by using SearchGUI.
After doing this, I obtain txt files containing the proteins that were found and the spectra that matched these proteins. What I want to do is to search the unmatched spectra against a customised database in order to discover novel proteins. Similar to how this paper (Erady, 2020) describes it:
In order to evade the increase in false-positive rates, MS data is first mapped to known proteins in UniProt database, and then the unmatched spectra are mapped to the custom proteogenomic database as done by us previously in Prabakaran et al.
This seems like a pretty common thing to do as I've seen a number of papers describe it. However, I can't figure out a way to do it. Does anyone have some experience doing this?
unfortunately no, I've only created databases of useful and relevant organisms matching what I expected in that sample. I understand the desire to not concatenate multiple fastas into one huge database that would interfere with PSM probabilities. What if spectral counts actually match better to proteins in the second database? I don't really like this method. Is it commonly used?
Let me know if you come up with a solution :(