my data set is like this
symbol synonyms
ACP2 1_8U;DSPA2b;IP15
ACTR2 12CC4;HSPC215;MFH;QRF1;hFKH1B
ADAM15 19A;CD319;CRACC;CS1
ADAT2 1R20;BL34;HEL_S_87;IER1;IR20
ADCY3 3_HAO;HAO
And what I want to do is split all the synonyms rows, in a way that I will be able to see how many Synonyms occur how many times.
So after I call summary on the dataframe, I want to end up with something like this:
symbol synonyms hgnc_id
ACP2 : 1 NA's : 65 NA's : 16
ACTR2 : 1 1_8U : 9 HGNC:1001 : 1
ADAM15 : 1 12CC4 : 2 HGNC:10223: 1
ADAT2 : 1 19A : 21 HGNC:10433: 1
Instead of what I get now, which is:
symbol synonyms hgnc_id
ACP2 : 1 NA's : 65 NA's : 16
ACTR2 : 1 1_8U;DSPA2b;IP15 : 1 HGNC:1001 : 1
ADAM15 : 1 12CC4;HSPC215;MFH;QRF1;hFKH1B: 1 HGNC:10223: 1
ADAT2 : 1 19A;CD319;CRACC;CS1 : 1 HGNC:10433: 1
ADCY3 : 1 1R20;BL34;HEL_S_87;IER1;IR20 : 1 HGNC:10449: 1
ADO : 1 3_HAO;HAO : 1 HGNC:10473: 1
I'm loading my code like this:
df <- read.csv("synonyms.csv", header = T, sep = '\t')
And I've been playing around with this
dt <- read.table(df)
out <- dt[, list(synonyms=unlist(strsplit(synonyms, char))), by=symbol]
But I get this error:
Error in `[.data.frame`(dt, , list(synonyms = unlist(strsplit(synonyms, : unused argument (by = symbol)
Any help?
several things I don't understand. First, I think in the expected output ACP2 should map to 1_8U (or your input file is wrong). Second, no idea how the counts are calculated. In any case, I think you are looking at something that can be done with unnest from the tidyr package.
It's just sample data. I copied pasted what I found at random. Just to give the basic idea.
Let's start with ACP2. What would be the synonyms count in our output? How would it be calculated? Why ACP2 is associated to 1_8U in the input, and to ACTR2 in the output?
It is unclear how the code you show produces the output you mention. It would be better to clearly write the code used, the output of that code and the output you want. Also, don't forget to write the value of used variables. For example, what is the value of "char" in the call to
strsplit()
? Why do you read the data withread.csv()
then pass that toread.table()
?Another comment, If I understood your intention you are interested in synonyms' counts. Then it would be better to include an example data that actually is similar to the real data and contains synonyms that appear more than once. Right now all are unique. But I may have misunderstood your goal.
Okay, the data I copied-pasted are wrong. I can't copy-paste 41000+ genes and their synonyms in this thread for obvious reasons.
I just took random genes and random synonyms, to present an example.
The file is generated through a Python script, so I can't post the code for that, because what it basically does is iterate a list of genes that I was given and compare them to a database I found, and simply appends the synonyms to each gene.
What I want to do is split the column with the synonyms so I can determine how many times each synonym appears.
For example if a gene named DSPA2b appears as a synonym 8 times, I would like to know this, and be able to see to which genes it is a synonym to.