Question

How do you collapse multiple rows based on multiple columns in r?

0

Entering edit mode

5.4 years ago

ngcatung0 ▴ 20

So basically I have a dataframe that kinda looks like this:

Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24 
Akutan city   NA    NA         NA      NA  NA    NA    71
Alcan Border  NA    NA         2       NA  NA    NA    NA               
Alcan Border  NA    NA         NA      NA  NA    2     NA            
Alcan Border  NA    NA         NA      NA  5     NA    NA
Ambler City   224   NA         NA      NA  NA    NA    NA
Ambler City   NA    NA         NA      17  NA    NA    NA

Is there a simple way to combine multiple rows based on multiple column data? I've seen a few scripts that say you can combine one duplicate variable in a column based on one or two data columns but I need to do it more large scale (I have ~400 rows with duplicates and ~30 columns (and each column has a large name). Ideally it would look like:

Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24 
Akutan city   NA    NA         NA      NA  NA    NA    71              
Alcan Border  NA    NA         2       NA  5     2     NA            
Ambler City   224   NA         NA      17  NA    NA    NA

I tried using the following code but im getting an error:

setDT(df)[,lapply(.SD, function(x) ifelseallis.na(x)), NA_integer_, sum("x", na.rm = T))), 
    by = sample_id]
data2 <- column_to_rownames(data2, 'Community')

Error in sum("x", na.rm = T) : invalid 'type' (character) of argument

R • 12k views

ADD COMMENT • link updated 5.4 years ago by Kevin Blighe 89k • written 5.4 years ago by ngcatung0 ▴ 20

0

Entering edit mode

Hi, After using df %>% group_by(Community) %>% summarise_if( is.numeric, sum, na.rm = TRUE )

It gave me the following output:

Community Akutan city
Alcan Border
Alcan Border
Alcan Border Ambler City
Ambler City

Please assist!

ADD REPLY • link 5.4 years ago by ngcatung0 ▴ 20

0

Entering edit mode

Help me help you. Could you use dput on a subset of your data and put the result into your original question: so that I have some example data to work with

ADD REPLY • link 5.4 years ago by russhh 5.8k

0

Entering edit mode

Example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

ADD REPLY • link 5.4 years ago by russhh 5.8k

score 0 · Answer 1 · 2020-03-05

0

Entering edit mode

5.4 years ago

russhh 5.8k

I believe you should be able to do the following:

library(dplyr)
df %>%
  group_by(Community) %>%
  summarise_if(
    is.numeric,
    sum,
    na.rm = TRUE
  )

But, I'm not sure I understand why you would have multiple rows for the same Community in this data-frame, and would recommend you remove the median age column from the output (because there's no sensible way to summarise the median ages unless they're identical in all rows for a given Community).

ADD COMMENT • link 5.4 years ago by russhh 5.8k

0

Entering edit mode

Hi, After using df %>% group_by(Community) %>% summarise_if( is.numeric, sum, na.rm = TRUE ) It gave me the following output: Community Akutan city Alcan Border Alcan Border Alcan Border Ambler City Ambler City Please assist!

ADD REPLY • link 5.4 years ago by ngcatung0 ▴ 20

score 0 · Answer 2 · 2020-03-06

You can use aggregate() to indirectly get what you want, I think:

x
     Community Pop_Total Median_Age Under_5 X5.9 X10.14 X15.19 X20.24
1  Akutan city        NA         NA      NA   NA     NA     NA     71
2 Alcan Border        NA         NA       2   NA     NA     NA     NA
3 Alcan Border        NA         NA      NA   NA     NA      2     NA
4 Alcan Border        NA         NA      NA   NA      5     NA     NA
5  Ambler City       224         NA      NA   NA     NA     NA     NA
6  Ambler City        NA         NA      NA   17     NA     NA     NA

aggregate(x[,2:ncol(x)], by = x[1], function(x) sum(x, na.rm = TRUE))
     Community Pop_Total Median_Age Under_5 X5.9 X10.14 X15.19 X20.24
1  Akutan city         0          0       0    0      0      0     71
2 Alcan Border         0          0       2    0      5      2      0
3  Ambler City       224          0       0   17      0      0      0

----------

To change those zeros to NA values:

x <- aggregate(x[,2:ncol(x)], by = x[1], function(x) sum(x, na.rm = TRUE))

x[x == 0] <- NA