I have about 40 columns that I want to summarize by group using dplyr::summarize()
. But each column is going to be summarized with a different function: some with mean()
, some with sum()
, some with min()
, and some with max()
. Only one function will be applied to each column. I have a lookup table to match each column name with the appropriate function. What is the best way to do this with dplyr::summarize()
without having to write out each statement individually?
Here is an example lookup table using mtcars. The code below it does not give the desired result. It applies all ten summary functions to all ten columns for a total of 100 summaries, where I only want ten summaries, one for each column.
library(dplyr)
lookup_table <- tibble(
variable = c('mpg', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'),
summ_fn = list(mean, max, min, sum, mean, mean, median, mean, sum, max)
)
mtcars %>%
group_by(cyl) %>%
summarize(across(all_of(lookup_table$variable), .fns = lookup_table$summ_fn))
I have about 40 columns that I want to summarize by group using dplyr::summarize()
. But each column is going to be summarized with a different function: some with mean()
, some with sum()
, some with min()
, and some with max()
. Only one function will be applied to each column. I have a lookup table to match each column name with the appropriate function. What is the best way to do this with dplyr::summarize()
without having to write out each statement individually?
Here is an example lookup table using mtcars. The code below it does not give the desired result. It applies all ten summary functions to all ten columns for a total of 100 summaries, where I only want ten summaries, one for each column.
library(dplyr)
lookup_table <- tibble(
variable = c('mpg', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'),
summ_fn = list(mean, max, min, sum, mean, mean, median, mean, sum, max)
)
mtcars %>%
group_by(cyl) %>%
summarize(across(all_of(lookup_table$variable), .fns = lookup_table$summ_fn))
Share
Improve this question
edited Mar 24 at 7:05
Darren Tsai
36.3k5 gold badges25 silver badges57 bronze badges
asked Mar 24 at 0:11
qdreadqdread
4,0652 gold badges26 silver badges43 bronze badges
5 Answers
Reset to default 7Instead of a tibble, initiate your lookup table to a named list where the names are the columns (i.e, lookup_table <- list("mpg" = mean, "disp" = max,...)
. Then you can use dplyr::curr_column
to index the function to the current column:
(Note that since I was a bit lazy and you (thankfully) provided the lookup table, I just used setNames
to assign the names.)
lookup_table <- setNames(list(mean, max, min, sum, mean, mean, median, mean, sum, max),
c('mpg', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'))
mtcars %>%
summarize(across(names(lookup_table),
~ lookup_table[[cur_column()]](.)),
.by = cyl)
As @G. Grothendieck points out in the comments, if you already have the lookup table in tibble form, you can create a little helper function and achieve the same goal:
lookup <- function(x) {
with(lookup_table, summ_fn[[match(x, variable)]])
}
mtcars %>%
summarize(across(all_of(lookup_table$variable),
~ lookup(cur_column())(.)),
.by = cyl)
Output (for both approaches):
# cyl mpg disp hp drat wt qsec vs am gear carb
# 1 6 19.74286 258.0 105 25.10 3.117143 17.97714 1 0.4285714 27 6
# 2 4 26.66364 146.7 52 44.78 2.285727 19.13727 1 0.7272727 45 2
# 3 8 15.10000 472.0 150 45.21 3.999214 16.77214 0 0.1428571 46 8
You can create a list of defused expressions, and inject it into summarize()
using the injection operator !!!
.
lookup_table <- setNames(list('mean', 'max', 'min', 'sum', 'mean', 'mean', 'median', 'mean', 'sum', 'max'),
c('mpg', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'))
exprs <- Map(\(x, y) call(x, as.name(y)), lookup_table, names(lookup_table))
# $mpg
# mean(mpg)
#
# $disp
# max(disp)
# .
# .
# .
mtcars %>%
summarize(!!!exprs, .by = cyl)
# cyl mpg disp hp drat wt qsec vs am gear carb
# 1 6 19.74286 258.0 105 25.10 3.117143 17.97714 1 0.4285714 27 6
# 2 4 26.66364 146.7 52 44.78 2.285727 19.13727 1 0.7272727 45 2
# 3 8 15.10000 472.0 150 45.21 3.999214 16.77214 0 0.1428571 46 8
Without dplyr
, using mapply
directly.
> mapply(\(f, x) tapply(mtcars[[x]], mtcars$cyl, f), lookup_table, names(lookup_table))
mpg disp hp drat wt qsec vs am gear carb
4 26.66364 146.7 52 44.78 2.285727 19.13727 1 0.7272727 45 2
6 19.74286 258.0 105 25.10 3.117143 17.97714 1 0.4285714 27 6
8 15.10000 472.0 150 45.21 3.999214 16.77214 0 0.1428571 46 8
Here's a variation that adds to your original lookup_table
:
library(dplyr)
library(rlang)
library(purrr)
summ_by <- function(fun, var, by) {
exec(fun, filter(mtcars, cyl==x) %>% pull(var))
}
Reduce(inner_join,
map(unique(mtcars[,'cyl']), \(by) {
lookup_table %>%
rowwise() %>%
mutate("cyl_{{by}}":=summ_by(summ_fn, variable, by))})
)
Giving
# A tibble: 10 × 5
# Rowwise:
variable summ_fn cyl_6 cyl_4 cyl_8
<chr> <list> <dbl> <dbl> <dbl>
1 mpg <fn> 19.7 26.7 15.1
2 disp <fn> 258 147. 472
3 hp <fn> 105 52 150
4 drat <fn> 25.1 44.8 45.2
5 wt <fn> 3.12 2.29 4.00
6 qsec <fn> 18.0 19.1 16.8
7 vs <fn> 1 1 0
8 am <fn> 0.429 0.727 0.143
9 gear <fn> 27 45 46
10 carb <fn> 6 2 8
To pass a different variable to group by (instead of "cyl"):
summarize_by <- function(data, by, lookup) {
library(purrr)
summ_by <- function(f, var, by, x)
exec(f, filter(data, !!sym(by)==x) %>% pull(var))
by <- deparse(substitute(by))
Reduce(inner_join,
map(unique(data[,by]), \(x) {
varname <- paste0(by, "_", x)
lookup %>%
rowwise() %>%
mutate(!!varname:=summ_by(summ_fn, variable, by, x))}))
}
Call this function:
summarize_by(mtcars, cyl, lookup_table)
# As above
summarize_by(mtcars, am, lookup_table)
# A tibble: 10 × 4
# Rowwise:
variable summ_fn am_1 am_0
<chr> <list> <dbl> <dbl>
1 mpg <fn> 24.4 17.1
2 disp <fn> 351 472
3 hp <fn> 52 62
4 drat <fn> 52.6 62.4
5 wt <fn> 2.41 3.77
6 qsec <fn> 17.4 18.2
7 vs <fn> 1 0
8 am <fn> 1 0
9 gear <fn> 57 61
10 carb <fn> 8 4
And without any group_by, we can just do:
lookup_table %>%
rowwise() %>%
mutate(result=exec(summ_fn, mtcars[,variable]))
# A tibble: 10 × 3
# Rowwise:
variable summ_fn result
<chr> <list> <dbl>
1 mpg <fn> 20.1
2 disp <fn> 472
3 hp <fn> 52
4 drat <fn> 115.
5 wt <fn> 3.22
6 qsec <fn> 17.8
7 vs <fn> 0
8 am <fn> 0.406
9 gear <fn> 118
10 carb <fn> 8
Here is another base R option, using by
+ Map
with(
lookup_table,
do.call(
rbind,
by(
mtcars[variable],
mtcars$cyl,
\(x) as.data.frame(
Map(\(f, y) f(y), summ_fn, x),
col.names = variable
)
),
)
)
which gives
mpg disp hp drat wt qsec vs am gear carb
4 26.66364 146.7 52 44.78 2.285727 19.13727 1 0.7272727 45 2
6 19.74286 258.0 105 25.10 3.117143 17.97714 1 0.4285714 27 6
8 15.10000 472.0 150 45.21 3.999214 16.77214 0 0.1428571 46 8
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744265282a4565837.html
评论列表(0条)