Hey there, facing an issue here that must be old as times.
I have a list of hits with Uniprot IDs (edit: identified by mass spectrometry by a platform). When using Uniprot's ID mapping, or DAVID, or your favorite mapper, it returns duplicates for many IDs. For example, the H4_HUMAN protein is associated to 14 genes/GeneIDs or SMN_HUMAN is associated to SMN1 and SMN2.
Question 1. For downstream analysis, you usually have to select one gene/sequence etc. How to objectively "choose the best one" without looking through all individual duplicates, and justify such choice?
Question 2. What would be the best way to reduce the information to a single table, with one line per entry, ideally adding duplicated GeneIDs separated by commas in a single cell? I am more fluent in excel but willing to get dirty in R. I have done it by hand before, but I would prefer an automated, streamlined process.
Cheers!
Can you give some insight into where you came into possession of these ID's? What kind of analysis you are doing? H4_HUMAN is pointing to histone genes of which there are plenty of copies so that is the reason why you have multiple genes associated with that UniProt ID.
Sure! I have a list of proteins identified by MS and provided by a platform.
So you are unlikely to know which exact copy that came from since the spectra are not representing full length proteins. You can simply use
Histone H4
as the name entry and leave it at that.Yes absolutely. However I would like to use "a" single ID for each given hit in various downstream applications requiring some specific database ID, typically GeneID.
In that case you could simply pick one (first one if you like) and make a note that you are making that choice arbitrarily.