Question

Data Manipulation

0

Entering edit mode

3.4 years ago

science03 ▴ 50

Hello everyone,

I have an excel file which has three columns. In the first column there are drug names (with repetition); In the second column there are cell line names in which these drugs are tested on. The same drug names as well cell line names can be found in multiple rows. In the third column I have the AUC value. The thing is that each drug has been tested on different nr of cell lines. I want to format this file so that i have a final matrix in which :

Drug names should be first column (one row to each drug), cell line names should be on the first row and the cells should be filled with AUC values. In those cells where a drug hasn't been tested on that specific cell line I want "NA" values.

Thank you in advance!

Excel Analysis R Function Data • 2.4k views

ADD COMMENT • link 3.4 years ago by science03 ▴ 50

2

Entering edit mode

enter image description here

mandatory picture...

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Any R suggestions are welcomed as well. I am not necessarily looking to solve this on Excel.

ADD REPLY • link 3.4 years ago by science03 ▴ 50

0

Entering edit mode

See related StackOverflow post, using R, tidyr::complete:

Fill missing combinations in a dataframe

ADD REPLY • link 3.4 years ago by zx8754 12k

2

Entering edit mode

3.4 years ago

dariober 15k

This is in part a rant... I don't understand why the "tidyverse" ecosystem has so many functions that seem to do the same thing. I mean, why is there spread, complete, gather, etc...? I don't use tidyr but I work with ggplot2 a lot and I have the same feeling. With data.table I can do pretty much everything dplyr does without learning a number of new functions. For example:

library(data.table)

df <- data.table(
    drug=rep(paste0('drug',seq(5)),5),
    cell=rep(paste0('cell',seq(5)),each=5)
)
set.seed(1234)
df$auc <- rbinom(25, 100, 0.5)

# Remove some observations
df <- df[auc < 50]

# Crosstabulate and use -1 to for missing values
dcast(data= df, drug ~ cell, value.var= 'auc', fill= -1)

    drug cell1 cell2 cell3 cell4 cell5
1: drug1    47    -1    -1    -1    46
2: drug2    40    -1    42    -1    -1
3: drug3    48    -1    -1    48    48
4: drug4    47    -1    47    49    -1
5: drug5    48    47    -1    49    46

ADD COMMENT • link 3.4 years ago by dariober 15k

1

Entering edit mode

spread and gather are now pivot_wider and pivot_longer, which is same as data.table dcast and melt. complete has a different functionality - in data.table it is CJ - see data.table equivalent of tidyr::complete().

So it is 3 functions in tidyverse and 3 functions in data.table.

(Yes, I prefer data.table, too)

ADD REPLY • link 3.4 years ago by zx8754 12k

0

Entering edit mode

Thanks for the comment - I don't doubt there is a strong reasoning behind tidyverse since otherwise it wouldn't be so popular - it's probably just me. I think this post summarises my feelings pretty well even if I don't agree with everything in there.

ADD REPLY • link 3.4 years ago by dariober 15k

0

Entering edit mode

3.4 years ago

Pierre Lindenbaum 165k

you want something like GNU datamash crosstab https://www.gnu.org/software/datamash/manual/html_node/Crosstab.html

ADD COMMENT • link 3.4 years ago by Pierre Lindenbaum 165k

score 2 · Accepted Answer · 2021-10-12

2

Entering edit mode

3.4 years ago

benformatics 4.1k

df <- data.frame(
'drug'=rep(paste0('drug',seq(5)),5),
'cell'=rep(paste0('cell',seq(5)),each=5)
)
df$auc <- rbinom(25, 100, 0.5)

> head(df)
   drug  cell auc
1 drug1 cell1  50
2 drug2 cell1  43
3 drug3 cell1  53
4 drug4 cell1  50
5 drug5 cell1  45
6 drug1 cell2  48

library(tidyr)
output <- df %>% spread(cell,auc)
output

   drug cell1 cell2 cell3 cell4 cell5
1 drug1    50    48    53    51    47
2 drug2    43    42    55    53    52
3 drug3    53    42    46    46    50
4 drug4    50    54    51    54    53
5 drug5    45    41    46    47    49

ADD COMMENT • link 3.4 years ago by benformatics 4.1k

0

Entering edit mode

If you want to read excel files as input (in my example I just made up some data) - use library(readxl); df <- read_excel('your_file.xls')

ADD REPLY • link 3.4 years ago by benformatics 4.1k

0

Entering edit mode

I think they want complete.

ADD REPLY • link 3.4 years ago by zx8754 12k

0

Entering edit mode

Unclear from their post but spread fills unknown cells with NA anyway

ADD REPLY • link 3.4 years ago by benformatics 4.1k

0

Entering edit mode

spread and gather are pivot_wider and pivot_longer now.

ADD REPLY • link 3.4 years ago by rpolicastro 13k

0

Entering edit mode

This was very useful. Thank you so much!

ADD REPLY • link 3.4 years ago by science03 ▴ 50