Question

Extracting observations from one dataframe to populate another

0

Entering edit mode

14 months ago

Rachel • 0

I have a very large dataframe with 396212 observations (rows) and 13 variables (columns), including organism name, antibiotic name, gene name and location.

I want to extract the unique observations (i.e. organism name) from variable X to populate another dataframe - essentially create a new dataframe with each unique antibiotic as the row and then a column for each unique organism and fill it with yes/No as for whether it covers that organism.

Example data frame

df <- data.frame(Organism = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"), 
             Antibiotic = c("X", "Y", "Z", "X", "X", "X", "X", "Y", "Y", "Z", "X", "Y"))

I have made a new data frame with the unique antibiotics as rows and organism names as columns and filled with NA, but I don't know how to extract information from the first data frame to populate the second

path_abx <- data.frame(Antibiotic = unique(df$Antibiotic))
path_abx$A <- NA
path_abx$B <- NA
path_abx$C <- NA
```r

My question is which pathogen is affected/targeted by each antibiotic. 

In the new dataframe (path_abx) I want to fill in the observations for each antibiotic and pathogen/organism as either 'yes' or the organism name, based on whether it appears in the original dataframe (df) in the same row as the antibiotic name. The actual dataframe has over 300,000 observations, (jncluding 39 antibiotics and 11 organisms) so I can't just do it manually. 

I have tried using `unique`, `select`, `filter`, `n_distinct`, `if/then` and `for` loops but I can't get what I want and don't know what to do. I'm sure the if/then or for loop is the way to go but I don't really know where to start with this.

```r
test <- df %>% group_by(Organism) %>% filter(Antibiotic=="X" & Organism =="A", ignore.case = TRUE)

test <-       
  if (df$Organism(grepl("B", ignore.case = TRUE)))
{
  print(df$Ecoli, "E.coli")
}

Etc - (I know the syntax is wrong)

Once I've figured this out I then need to do the same thing (which pathogen is affected?) for each gene (1400 genes).

I'm really stumped so would appreciate any pointers!

Thank you :)

R tibble stringr dplyr tidyr • 542 views

ADD COMMENT • link updated 14 months ago by Ram 45k • written 14 months ago by Rachel • 0

Ram · Answer 1 · 2024-02-05

0

Entering edit mode

14 months ago

rpolicastro 13k

Assuming I understand your desired output table would be the easiest way to approach this.

as.data.frame.matrix(ifelse(t(table(df)) > 0, "yes", "no"))

The resulting data.frame.

    A   B   C
X yes yes yes
Y  no yes yes
Z yes  no yes

ADD COMMENT • link updated 14 months ago by Ram 45k • written 14 months ago by rpolicastro 13k