Question

Loading the ID mapping file from Uniprot

0

Entering edit mode

2.6 years ago

MB • 0

I am trying to find the list of STRING IDs (e.g. 9606.ENSP00000293677) and their corresponding entry IDs (e.g. CASPE_HUMAN) as a downloadable file. After asking the help desk, they said I need the information found in this Uniprot ID mapping file: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping.dat.gz After downloading a .dat file, I was trying to convert it into a pandas df and I expect to see more than 15 columns for various identifiers about a given entry on Uniprot. When I tried running these lines of code, it produces a df with three columns and the last method I tried just gave me an array of numbers.

df3 = pd.read_csv('HUMAN_9606_idmapping.dat',  sep ='\t', nrows=10) 

df3 = pd.read_csv(StringIO('HUMAN_9606_idmapping.dat'), 
                 sep="\t", 
                 index_col=0, #convert first column to datetimeindex
                 header=None) #none header

df3 = np.genfromtxt('HUMAN_9606_idmapping.dat', unpack = True, delimiter='\t')

If anyone can help me, I would be very appreciative!

ENSEMBL python UniProt Uniprot • 1.2k views

ADD COMMENT • link updated 2.6 years ago by Elisabeth Gasteiger ★ 2.4k • written 2.6 years ago by MB • 0

score 0 · Answer 1 · 2022-08-08

If you extract all lines from this file that contain "STRING", you will obtain a list of this form

P31946  STRING  9606.ENSP00000361930
P62258  STRING  9606.ENSP00000264335
Q04917  STRING  9606.ENSP00000248975
P61981  STRING  9606.ENSP00000306330
P31947  STRING  9606.ENSP00000340989
P27348  STRING  9606.ENSP00000371267
P63104  STRING  9606.ENSP00000379287
Q96QU6  STRING  9606.ENSP00000263776
Q4AC99  STRING  9606.ENSP00000368109
Q15172  STRING  9606.ENSP00000261461
Q15173  STRING  9606.ENSP00000164133
Q14738  STRING  9606.ENSP00000417963
Q16537  STRING  9606.ENSP00000337641
Q13362  STRING  9606.ENSP00000412324
P30153  STRING  9606.ENSP00000324804
P30154  STRING  9606.ENSP00000311344
P63151  STRING  9606.ENSP00000325074