Extracting observations from one dataframe to populate another
1
0
Entering edit mode
9 months ago
Rachel • 0

I have a very large dataframe with 396212 observations (rows) and 13 variables (columns), including organism name, antibiotic name, gene name and location.

I want to extract the unique observations (i.e. organism name) from variable X to populate another dataframe - essentially create a new dataframe with each unique antibiotic as the row and then a column for each unique organism and fill it with yes/No as for whether it covers that organism.

Example data frame

df <- data.frame(Organism = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"), 
             Antibiotic = c("X", "Y", "Z", "X", "X", "X", "X", "Y", "Y", "Z", "X", "Y"))

I have made a new data frame with the unique antibiotics as rows and organism names as columns and filled with NA, but I don't know how to extract information from the first data frame to populate the second

path_abx <- data.frame(Antibiotic = unique(df$Antibiotic))
path_abx$A <- NA
path_abx$B <- NA
path_abx$C <- NA
```r

My question is which pathogen is affected/targeted by each antibiotic. 

In the new dataframe (path_abx) I want to fill in the observations for each antibiotic and pathogen/organism as either 'yes' or the organism name, based on whether it appears in the original dataframe (df) in the same row as the antibiotic name. The actual dataframe has over 300,000 observations, (jncluding 39 antibiotics and 11 organisms) so I can't just do it manually. 

I have tried using `unique`, `select`, `filter`, `n_distinct`, `if/then` and `for` loops but I can't get what I want and don't know what to do. I'm sure the if/then or for loop is the way to go but I don't really know where to start with this.

```r
test <- df %>% group_by(Organism) %>% filter(Antibiotic=="X" & Organism =="A", ignore.case = TRUE)

test <-       
  if (df$Organism(grepl("B", ignore.case = TRUE)))
{
  print(df$Ecoli, "E.coli")
}

Etc - (I know the syntax is wrong)

Once I've figured this out I then need to do the same thing (which pathogen is affected?) for each gene (1400 genes).

I'm really stumped so would appreciate any pointers!

Thank you :)

R tibble stringr dplyr tidyr • 395 views
ADD COMMENT
0
Entering edit mode
9 months ago

Assuming I understand your desired output table would be the easiest way to approach this.

as.data.frame.matrix(ifelse(t(table(df)) > 0, "yes", "no"))

The resulting data.frame.

    A   B   C
X yes yes yes
Y  no yes yes
Z yes  no yes
ADD COMMENT

Login before adding your answer.

Traffic: 1959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6