Hi everybody, I have a huge data set (from different locations) that I have to harmonized and do some data cleaning, as an example below:
ID loc1_age proce_age loc1_dof proce_dof loc1_bmi proce_bmi site4_wight pl3_length
1 20 NA 1999 NA 54 NA NA NA
2 15 NA 2000 NA 46 NA NA NA
3 NA 9 NA 1998 NA 24 NA NA
4 NA 15 NA 2002 NA 36 NA NA
...
505 NA NA NA NA NA NA 78 NA
506 NA NA NA NA NA NA 94 NA
...
1956 NA NA NA NA NA NA NA 176
1957 NA NA NA NA NA NA NA 189
final data:
ID age dof bmi site4_wight pl3_length
1 20 1999 54 NA NA
2 15 2000 46 NA NA
3 9 1998 24 NA NA
4 15 2002 36 NA NA
...
505 NA NA NA 78 NA
506 NA NA NA 94 NA
...
1956 NA NA NA NA 176
1957 NA NA NA NA 189
For doing this, I'm using R, in more detail df %>% mutate(age = coalesce(loc1_age, proce_age))
My concern is, with a few of variables it works manually, but I have like around 41 variables to coalesce
#get common variables for two subset
sub_loc= sub("^loc_", "", loc1)
sub_proce= sub("^proce", "", proce)
ii = intersect(sub_loc, sub_proce) # this give to me 41 variables in common
length(ii)
what I've tried so far to implement a loop in R and create the new variables is this:
for (i in ii) {
df_populated <- df %>%
mutate(paste("loc", i, sep = "") = coalesce(paste("loc", i, sep = ""), paste("proce", i, sep = "")
}
But it doesn't work at all. Any ideas to do life easier with data cleaning? Thanks
Is it correct to assume that every ID only has one value, i.e. if there's an entry in one of the location columns, all the other ones will have NA?
Then this could work:
or without data.table:
Thank for the answer, but this is not what I asked. As I know how to merge the variables (I explained it in my post), I do not know how to do with a list of variables in one step.
The last line of the code above should be:
then if produces the output you wrote you wanted from the example data you gave. If this is not what you want, then please clarify.
EDIT: I now understand you want to process multiple variables in the same way, eventually in one step. Repeating the given code for each variable in a loop should work.
But this do it only for one variable. I need to do it for a list of variables
I don't understand. The
melt
command will work irrespective of the number of additional columns beyond ID.