Question

How to deal with duplicates between two database IDs?

0

Entering edit mode

16 months ago

benoahb ▴ 40

Hey there, facing an issue here that must be old as times.

I have a list of hits with Uniprot IDs (edit: identified by mass spectrometry by a platform). When using Uniprot's ID mapping, or DAVID, or your favorite mapper, it returns duplicates for many IDs. For example, the H4_HUMAN protein is associated to 14 genes/GeneIDs or SMN_HUMAN is associated to SMN1 and SMN2.

Question 1. For downstream analysis, you usually have to select one gene/sequence etc. How to objectively "choose the best one" without looking through all individual duplicates, and justify such choice?

Question 2. What would be the best way to reduce the information to a single table, with one line per entry, ideally adding duplicated GeneIDs separated by commas in a single cell? I am more fluent in excel but willing to get dirty in R. I have done it by hand before, but I would prefer an automated, streamlined process.

Cheers!

Identifiers Database • 1.8k views

ADD COMMENT • link 16 months ago by benoahb ▴ 40

0

Entering edit mode

Can you give some insight into where you came into possession of these ID's? What kind of analysis you are doing? H4_HUMAN is pointing to histone genes of which there are plenty of copies so that is the reason why you have multiple genes associated with that UniProt ID.

ADD REPLY • link 16 months ago by GenoMax 148k

0

Entering edit mode

Sure! I have a list of proteins identified by MS and provided by a platform.

ADD REPLY • link 16 months ago by benoahb ▴ 40

0

Entering edit mode

So you are unlikely to know which exact copy that came from since the spectra are not representing full length proteins. You can simply use Histone H4 as the name entry and leave it at that.

ADD REPLY • link 16 months ago by GenoMax 148k

0

Entering edit mode

Yes absolutely. However I would like to use "a" single ID for each given hit in various downstream applications requiring some specific database ID, typically GeneID.

ADD REPLY • link 16 months ago by benoahb ▴ 40

1

Entering edit mode

However I would like to use "a" single ID for each given hit

In that case you could simply pick one (first one if you like) and make a note that you are making that choice arbitrarily.

ADD REPLY • link 16 months ago by GenoMax 148k

score 1 · Answer 1 · 2023-08-16

Question 1: I have no experience with protein MS, but it may well be that you just can't determine the protein source more precisely. If the measured fragment is present in multiple isoforms, you simply can't tell without additional data. If you have it, you could try to infer it by using e.g. gene expression data from your cells and eliminate those that are very unlikely to be expressed, but you should state this clearly in your methods. Also consider the type of analysis you are doing. For some, there might be no need to further drill it down.

Question 2: I suppose it is also possible in Excel, but in R you do it like this:

Base R:

aggregate(. ~ Species, FUN=paste,collapse=",", data=iris)

Tidyverse:

iris %>% group_by(Species) %>% summarise_at(vars(-group_cols()), paste, collapse=",")

The iris dataset is just used as an default example, you can also use palmerpenguins or the like.

score 0 · Answer 2 · 2023-08-16

0

Entering edit mode

16 months ago

barslmn ★ 2.3k

Both SMN1 and SMN2 plays role in the phenotype. While SMN1 causes the diease number of SMN2 copies can impact the severity. Those associations are not redundant.

I don't think there is a way of selecting a gene yet alone an "objective" way. You can prioritize transcripts like MANE or Ensembl's canonical but you must keep in mind tissue and time you're looking at.

It would better to first identify what are the causes of the duplications and make a decision based on that.

ADD COMMENT • link 16 months ago by barslmn ★ 2.3k

1

Entering edit mode

Issue is that the identification is being done using spectral information (which is for a peptide fragment). It appears that the program being used for search is using "swiss-prot" as a reference, where multiple genes can be annotated under one entry.

The only way around this issue would be to do the initial spectral search against a database other than swissprot. One could use MANE transcripts but depending on sequence is shared by SMN1/SMN2 a peptide fragment may still map to multiple proteins.

ADD REPLY • link 16 months ago by GenoMax 148k

0

Entering edit mode

I haven't refreshed to see the edit. I'm not that familiar with MS, but I am guessing differentiating SMN1 and SMN2 from peptide fragments might require a different experimental approach.

ADD REPLY • link 16 months ago by barslmn ★ 2.3k

0

Entering edit mode

That's entirely true. From a quick search: "SMN1 is the disease gene because it produces FL SMN protein. The SMN2 allele is the disease-modifying gene because of a single nucleotide difference in exon 7 that results in alternative processing of its mRNA and editing out of exon 7." I recon that chances are very little to distinguish those two and by extention, any of the other hits with multiple associated genes, regardless of the database used.

At this point, I guess that I should just pick one randomly and tread carefully in any downstream application focusing on those hits.

ADD REPLY • link 16 months ago by benoahb ▴ 40