I have a list of gene names(symbols) and I want to convert them into ensembl protein id(ENSP).
There are many tools like BioMart, David, bioDBnet, etc. However, all of them return multiple ensemble protein ids for a single gene. This is probably due to different transcriptions or splicing. If I want to use one of these ids, which one should I use?
In fact, I need these ids to extract ppi networks from string-db database files.
string-db uses ensembl protein ids in its database files and I don't know which ensembl id it uses for each gene.
I am assuming that you want to create a network using string data, but string-db provides the interaction data for ppi, hence their file contains the network information on the transcripts level (ENSPs), which you obviously cannot use for creating a network with gene symbols. Actually you need not go to any other site than string-db for obtaining the ENSP to Gene symbol mappings. Here's what you should do:
On the string download page, select the organism for which you want to download the data and then look at the "General flatfiles & full database dumps" section. You will see a "protein aliases" file link in the list, download that file.
Download that file. It contains species id, protein id (ENSP), alias (gene symbol is found in this column) and source. In the source column, use sources like BLAST_UniProt_GN, Ensembl_UniProt_GN or any source that you want to add in this to obtain the lines that only map the ENSP to the gene symbols, since there are many more identifier mappings in this file. Note that for each gene symbol, there would be multiple transcripts (ENSPs) , hence multiple rows.
Once you get the curated mapping list, use the string "protein links" file to obtain the network interaction data and simply replace the protein identifiers in that file with their mapped gene symbols. Now you have the string data in terms of gene symbols.
Create a network using your list of gene symbols.
Note: It seems that the string-db files are being updated to v10 currently. If the file is not available right now, do check after some time to get the updated data.
ADD COMMENT
• link
updated 2.4 years ago by
Ram
44k
•
written 9.6 years ago by
Uma A
▴
230