Hello friends,
I have a dataset of genes associated with drugs derived from DrugBank. I wish to simply translate all the drugbank IDs to drug names readable by a human.
The drugbank vocabulary (.csv) looks like this: [DBvocabulary.csv]
DrugBank.ID,Common.name
DB00001,Lepirudin
DB00002,Cetuximab
DB00003,Dornase alfa
DB00004,Denileukin diftitox
DB00005,Etanercept
DB00006,Bivalirudin
My dataset (.csv) has 15 columns but the important ones are:
[all_ph_active.csv]
Gene.Name,DrugBank.ID
F8,DB09130
TCN2,DB00200
LDLR,DB09270; DB11251; DB14003
ALB,DB00070; DB00137; DB00159; DB00162; DB00214
As you can see some genes are linked to multiple or even hundreds of drugs. the multiple Drug IDs are separated by semicolons, in the same comma-delimited "column" The R studio match or merge function only work for the first identifier in each column, thus effectively deleting the remainder in the same column "cell".
Any advice is welcome, thanks in advance!
There are several ways (like join etc). But I would suggest you sqldf package in R. However, these can be done outside R as well.