Question

Which Ensembl protein id (ENSP) should I use? (Which id is used by string-db)

0

Entering edit mode

9.6 years ago

Soheil ▴ 110

Hi,

I have a list of gene names(symbols) and I want to convert them into ensembl protein id(ENSP).

There are many tools like BioMart, David, bioDBnet, etc. However, all of them return multiple ensemble protein ids for a single gene. This is probably due to different transcriptions or splicing. If I want to use one of these ids, which one should I use?

In fact, I need these ids to extract ppi networks from string-db database files.

string-db uses ensembl protein ids in its database files and I don't know which ensembl id it uses for each gene.

Does anyone has any idea?!!

gene symbol ensembl id string-db id convert • 14k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Soheil ▴ 110

0

Entering edit mode

7.3 years ago

Nitro_Shade ▴ 30

Hey!

Not sure if it's still relevant or not, but I made a very small utility to do this extraction. It can be found here.

ADD COMMENT • link 7.3 years ago by Nitro_Shade ▴ 30

0

Entering edit mode

7.3 years ago

Abhik ▴ 30

I found that in the protein alias file there is no such GENE Symbols. One of the ways to convert ENSP to HUGO gene Symbol is using script below

mart = useMart(host = 'grch37.ensembl.org', biomart='ENSEMBL_MART_ENSEMBL', dataset='hsapiens_gene_ensembl')
mart=useDataset("hsapiens_gene_ensembl","" mart = mart)

ensembl_genes <- "ENSP00000000233"

gene_names <- getBM(
    filters= "ensembl_peptide_id", 
    attributes= c("ensembl_peptide_id","hgnc_symbol","description"),
    values= ensembl_genes,
    mart= mart)

ensembl_peptide_id hgnc_symbol                                            description
> ENSP00000000233        ARF5 ADP-ribosylation factor 5 [Source:HGNC Symbol;Acc:658]

Hope this helps.

ADD COMMENT • link 7.3 years ago by Abhik ▴ 30

0

Entering edit mode

7.3 years ago

mahmoud.s.fahmy • 0

The previous answers would do. An easier way is to use the get_aliases method from the STRINGdb directly.

ADD COMMENT • link 7.3 years ago by mahmoud.s.fahmy • 0

Ram · Accepted Answer · 2015-04-26

I am assuming that you want to create a network using string data, but string-db provides the interaction data for ppi, hence their file contains the network information on the transcripts level (ENSPs), which you obviously cannot use for creating a network with gene symbols. Actually you need not go to any other site than string-db for obtaining the ENSP to Gene symbol mappings. Here's what you should do:

On the string download page, select the organism for which you want to download the data and then look at the "General flatfiles & full database dumps" section. You will see a "protein aliases" file link in the list, download that file.
Download that file. It contains species id, protein id (ENSP), alias (gene symbol is found in this column) and source. In the source column, use sources like BLAST_UniProt_GN, Ensembl_UniProt_GN or any source that you want to add in this to obtain the lines that only map the ENSP to the gene symbols, since there are many more identifier mappings in this file. Note that for each gene symbol, there would be multiple transcripts (ENSPs) , hence multiple rows.
Once you get the curated mapping list, use the string "protein links" file to obtain the network interaction data and simply replace the protein identifiers in that file with their mapped gene symbols. Now you have the string data in terms of gene symbols.
Create a network using your list of gene symbols.

Note: It seems that the string-db files are being updated to v10 currently. If the file is not available right now, do check after some time to get the updated data.