Question

Mutating columns by changing units - R/dplyr solution

0

Entering edit mode

4.1 years ago

ellieuk ▴ 40

Hello, I'm struggling with some data I have - I have multiple columns of blood results corresponding to different units and I'd like to make a single column of values in the same units.

Dummy data:

Units   Other   Score   Score2
Not measured            
ug/L                     29.9
ng/L              5 
ng/L             109    
ng/L            
Not measured            
ng/L               5    

ng/L               2    
ng/L            
ng/L            
ng/L            
ng/L            
ng/L            
Other   pg/ml             11.3

What I'd like to do is convert all scores to ug/L. This will involve transforming ng/L to ug/L (x 0.001) and converting pg/ml to ug/L (x 0.001). What I would like at the end is:

Units   Other   Score   Score2  Score_ug/L
Not measured                
ug/L                      29.9     29.9
ng/L              5                0.005
ng/L             109               0.109
ng/L                
Not measured                
ng/L              5                0.005

ng/L              2                0.002
ng/L                
ng/L                
ng/L                
ng/L                
ng/L                
Other   pg/ml             11.3     0.0113

I'd like to retain blank rows and also the rows that have "Not measured".

I don't know if this is something dplyr is capable of? What I can't manage to do is mutate the units column when it varies by unit and some units aren't measured and other units are in a separate column. I tried to use the summarise across columns function, but I became unstuck when the last row had units in a different column :-(.

If anyone can point me in the right direction, I'd be extremely grateful.

Pls go easy, I'm a clinician not a bioinformatician....

R • 2.9k views

ADD COMMENT • link updated 4.1 years ago by Dunois ★ 2.9k • written 4.1 years ago by ellieuk ▴ 40

0

Entering edit mode

Could you please give us some representative data to work with?

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k

0

Entering edit mode

Can you not see the data in the post? This is representative of the dataset... ! Values have been changed, but otherwise this is a snapshot of what I'm working with. Blank lines need to be retained because there are other (unshown) columns with values in. Dataset is over 1000 columns, so can't possibly share. I'm only showing the relevant bits.

ADD REPLY • link 4.1 years ago by ellieuk ▴ 40

0

Entering edit mode

Apologies. Is that completely representative? Are there any aspects to the data that isn't addressed by the example in your OP?

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k

0

Entering edit mode

For what I need it is completely representative. If it helps, you can assume every row has a unique ID. Otherwise it is unfortunately what I'm dealing with as the output!

ADD REPLY • link 4.1 years ago by ellieuk ▴ 40

0

Entering edit mode

To share your data, please paste in the results of dput(head(data)) into your original question. Thanks.

ADD REPLY • link 4.1 years ago by 4galaxy77 2.9k

score 3 · Accepted Answer · 2021-01-18

3

Entering edit mode

4.1 years ago

Dunois ★ 2.9k

Here's one potential solution. Assuming your data is stored in a data.frame named df:

library(stringr)
library(dplyr)
library(magrittr)

#Converting all blanks into NAs for easier handling.
df %<>% mutate(across(everything(), ~ifelse(str_detect(., "^$"), NA, .)))

#New score column.
df$score_ugl <- NA

#Looping through and calculating the scores.
for(i in 1:nrow(df)){

  if(!is.na(df$Units[i])){

    if(df$Units[i] == "ug/L"){
      df$score_ugl[i] <- df$Score2[i]
    }
    if(df$Units[i] == "ng/L"){
      df$score_ugl[i] <- df$Score[i] * 0.001
    }
    if(df$Units[i] == "Other" & df$Other[i] == "pg/ml"){
      df$score_ugl[i] <- df$Score2[i] * 0.001
    }

  }

}

#Output
df


#         Units Other Score Score2 score_ugl
#1 Not measured  <NA>    NA     NA        NA
#2         ug/L  <NA>    NA   29.2   29.2000
#3         ng/L  <NA>     5     NA    0.0050
#4         ng/L  <NA>   109     NA    0.1090
#5         <NA>  <NA>    NA     NA        NA
#6        Other pg/ml    NA   11.3    0.0113

Using a for loop will be frowned upon by more adept R practitioners, but it should suffice in this case (and, IMHO, also happens to be more pliant to modification, should you have to deal with additional cases and what not).

ADD COMMENT • link 4.1 years ago by Dunois ★ 2.9k

2

Entering edit mode

Your answer and for loops are fine, the only problem being that it would run slow with a lot of data because it's not vectorized. When you have a lot of simple conditionals like this it's generally more compact and readable to use a switch/case statement, such as the functions switch or case_when in R. These methods are usually vectorized too.

Example data.

df <- structure(list(Units = c("not_measured", "ug/L", "ng/L", "Other"
), Other = c(NA, NA, NA, "pg/ml"), Score = c(NA, NA, 5, NA), 
    Score2 = c(NA, 29.9, NA, 11.3)), class = "data.frame", row.names = c(NA, 
-4L))

> df
         Units Other Score Score2
1 not_measured  <NA>    NA     NA
2         ug/L  <NA>    NA   29.9
3         ng/L  <NA>     5     NA
4        Other pg/ml    NA   11.3

Using mutate and case_when from dplyr.

library("dplyr")

df <- df %>%
  mutate(score_ugl=case_when(
    Units == "ug/L" ~ Score2,
    Units == "ng/L" ~ Score * 0.001,
    Units == "Other" & Other == "pg/ml" ~ Score2 * 0.001
  ))

> df
         Units Other Score Score2 score_ugl
1 not_measured  <NA>    NA     NA        NA
2         ug/L  <NA>    NA   29.9   29.9000
3         ng/L  <NA>     5     NA    0.0050
4        Other pg/ml    NA   11.3    0.0113

ADD REPLY • link 4.1 years ago by rpolicastro 13k

2

Entering edit mode

Never realized case_when() is a thing. TIL. Thank you!!

ADD REPLY • link 4.1 years ago by Dunois ★ 2.9k