Colapse column values to multiple rows for further analysis.
1
1
Entering edit mode
8.2 years ago
Jack ▴ 120

my data set is like this

  symbol             synonyms   
  ACP2               1_8U;DSPA2b;IP15 
  ACTR2              12CC4;HSPC215;MFH;QRF1;hFKH1B
  ADAM15             19A;CD319;CRACC;CS1
  ADAT2              1R20;BL34;HEL_S_87;IER1;IR20
  ADCY3              3_HAO;HAO

And what I want to do is split all the synonyms rows, in a way that I will be able to see how many Synonyms occur how many times.

So after I call summary on the dataframe, I want to end up with something like this:

 symbol        synonyms                            hgnc_id
 ACP2   :  1   NA's                         : 65   NA's       : 16  
 ACTR2  :  1   1_8U                         :  9   HGNC:1001 :  1  
 ADAM15 :  1   12CC4                        :  2   HGNC:10223:  1  
 ADAT2  :  1   19A                          :  21  HGNC:10433:  1

Instead of what I get now, which is:

 symbol        synonyms                            hgnc_id   
 ACP2   :  1   NA's                         : 65   NA's      : 16  
 ACTR2  :  1   1_8U;DSPA2b;IP15             :  1   HGNC:1001 :  1  
 ADAM15 :  1   12CC4;HSPC215;MFH;QRF1;hFKH1B:  1   HGNC:10223:  1  
 ADAT2  :  1   19A;CD319;CRACC;CS1          :  1   HGNC:10433:  1  
 ADCY3  :  1   1R20;BL34;HEL_S_87;IER1;IR20 :  1   HGNC:10449:  1  
 ADO    :  1   3_HAO;HAO                    :  1   HGNC:10473:  1

I'm loading my code like this:

df <- read.csv("synonyms.csv", header = T, sep = '\t')

And I've been playing around with this

dt <- read.table(df)
out <- dt[, list(synonyms=unlist(strsplit(synonyms, char))), by=symbol]

But I get this error:

Error in `[.data.frame`(dt, , list(synonyms = unlist(strsplit(synonyms,  : unused argument (by = symbol)

Any help?

r csv • 2.3k views
ADD COMMENT
1
Entering edit mode

several things I don't understand. First, I think in the expected output ACP2 should map to 1_8U (or your input file is wrong). Second, no idea how the counts are calculated. In any case, I think you are looking at something that can be done with unnest from the tidyr package.

ADD REPLY
0
Entering edit mode

It's just sample data. I copied pasted what I found at random. Just to give the basic idea.

ADD REPLY
0
Entering edit mode

Let's start with ACP2. What would be the synonyms count in our output? How would it be calculated? Why ACP2 is associated to 1_8U in the input, and to ACTR2 in the output?

ADD REPLY
0
Entering edit mode

It is unclear how the code you show produces the output you mention. It would be better to clearly write the code used, the output of that code and the output you want. Also, don't forget to write the value of used variables. For example, what is the value of "char" in the call to strsplit()? Why do you read the data with read.csv() then pass that to read.table()?

ADD REPLY
0
Entering edit mode

Another comment, If I understood your intention you are interested in synonyms' counts. Then it would be better to include an example data that actually is similar to the real data and contains synonyms that appear more than once. Right now all are unique. But I may have misunderstood your goal.

ADD REPLY
0
Entering edit mode

Okay, the data I copied-pasted are wrong. I can't copy-paste 41000+ genes and their synonyms in this thread for obvious reasons.

I just took random genes and random synonyms, to present an example.

The file is generated through a Python script, so I can't post the code for that, because what it basically does is iterate a list of genes that I was given and compare them to a database I found, and simply appends the synonyms to each gene.

What I want to do is split the column with the synonyms so I can determine how many times each synonym appears.

For example if a gene named DSPA2b appears as a synonym 8 times, I would like to know this, and be able to see to which genes it is a synonym to.

ADD REPLY
0
Entering edit mode
8.2 years ago
ddiez ★ 2.0k

Let assume the following (fake) data:

d <- data.frame(
  symbol = c("A", "B", "C", "D"),
  synonyms = c("AA;A1;FOO", "BB;B1;FOO;GOO", "CC;GOO", "DD;FOO"),
  stringsAsFactors = FALSE
)
d
  symbol      synonyms
1      A     AA;A1;FOO
2      B BB;B1;FOO;GOO
3      C        CC;GOO
4      D        DD;FOO

This breaks the synonyms into vectors (of different lengths) for each symbol, and stores them into a list. The length of the list is the same as the number of rows:

syns <- strsplit(d$synonyms, ";")
names(syns) <- d$symbol
syns
$A
[1] "AA"  "A1"  "FOO"

$B
[1] "BB"  "B1"  "FOO" "GOO"

$C
[1] "CC"  "GOO"

$D
[1] "DD"  "FOO"

We create a matrix to store the number of times a synonym occurs, with rows as synonyms and columns as symbols:

m <- matrix(0L, ncol = length(syns), nrow = length(unique(unlist(syns))))
rownames(m) <- unique(unlist(syns))
colnames(m) <- names(syns)
m
    A B C D
AA  0 0 0 0
A1  0 0 0 0
FOO 0 0 0 0
BB  0 0 0 0
B1  0 0 0 0
GOO 0 0 0 0
CC  0 0 0 0
DD  0 0 0 0

Then we iterate over the symbols and increase the count for the corresponding synonyms. Finally we compute the total count

for(s in names(syns)) {
  m[syns[[s]], s] <- m[syns[[s]], s] + 1
}
m <- cbind(total = rowSums(m), m)
m
    total A B C D
AA      1 1 0 0 0
A1      1 1 0 0 0
FOO     3 1 1 0 1
BB      1 0 1 0 0
B1      1 0 1 0 0
GOO     2 0 1 1 0
CC      1 0 0 1 0
DD      1 0 0 0 1

Disclaimer: I am not sure how this will scale for >40,000 symbols and >>40,000 synonyms.

EDIT 1

A small simulation with ~40,000 genes and synonyms taken at random from a list of 80,000 synonyms confirms my concerns in the disclaimer above by creating a matrix m > 22 Gb size. Given that for a given gene the number of synonyms will be small compared to the total number, and that for a given synonym the number of genes it appears in will be small compared to the total number, the matrix m will be sparse. So an alternative could be to use sparse matrix in the package Matrix. This leads to a matrix with a ~10 Mb footprint in my simulation.

library(Matrix)
m <- Matrix(0L, ncol = length(syns), nrow = length(unique(unlist(syns))), sparse = TRUE)

EDIT 2

Using the for() loop in the solution above (even when using sparse matrices) is very slow. This solution is a lot faster:

syns <- strsplit(d$synonyms, ";")
names(syns) <- d$symbol
m <- data.frame(
  synonym = unlist(syns),
  symbol = rep(names(syns), times = sapply(syns, length))
)

# Create contingency table. This could be done with table() but it
# fails for a large dataset like the one at hand here. With xtable()
# we can specify sparse=TRUE to obtain a sparse matrix.
m <- xtabs(~synonym+symbol, m, sparse = TRUE)
m <- cbind(total = rowSums(m), m)
m[1:10,1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
   [[ suppressing 10 column names ‘total’, ‘G-1’, ‘G-10’ ... ]]

S-1     4 . . . . . . . . .
S-10    2 . . . . . . . . .
S-100   1 . . . . . . . . .
S-1000  6 . . . . . . . . .
S-10000 2 . . . . . . . . .
S-10001 2 . . . . . . . . .
S-10002 3 . . . . . . . . .
S-10003 3 . . . . . . . . .
S-10004 1 . . . . . . . . .
S-10005 4 . . . . . . . . .

dim(m)
[1] 74964 40001
ADD COMMENT

Login before adding your answer.

Traffic: 2062 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6