I have an R data frame that I need to perform a random binomial draw for each row. The n =
argument in the random binomial draw will be based on a value in a column of that row. Further, this operation should be within a case_when()
based upon a conditional in the data.
Note: R's rowwise()
function in tidyverse
is much too slow, the data frame is too large and is being performed at each timestep in a simulation model. Is there a way to quickly and efficiently do this?
Example:
library(tidyverse)
df = data.frame(condition = c("A","B","A","B","C"),
number = c(1000,1000,1000,1000,1))
prob1 = 0.000517143
prob2 = 0.000213472
set.seed(1)
df = df %>%
mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
size = 1,
prob = prob1)),
condition == "B" ~ sum(rbinom(n = number,
size = 1,
prob = prob2)),
TRUE ~ 0))
print(df)
#> condition number output
#> 1 A 1000 0
#> 2 B 1000 0
#> 3 A 1000 0
#> 4 B 1000 0
#> 5 C 1 0
Here, it looks like the random binomial draws are being reused and returning all zeros.
For a check, here it is sampled repeatedly. Feasibly, the sum(df$output)
should be around 2 each draw.
for(i in 1:10){
df = df %>%
mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
size = 1,
prob = prob1)),
condition == "B" ~ sum(rbinom(n = number,
size = 1,
prob = prob2)),
TRUE ~ 0))
print(sum(df$output))}
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
Unsure of the way forward.
I have an R data frame that I need to perform a random binomial draw for each row. The n =
argument in the random binomial draw will be based on a value in a column of that row. Further, this operation should be within a case_when()
based upon a conditional in the data.
Note: R's rowwise()
function in tidyverse
is much too slow, the data frame is too large and is being performed at each timestep in a simulation model. Is there a way to quickly and efficiently do this?
Example:
library(tidyverse)
df = data.frame(condition = c("A","B","A","B","C"),
number = c(1000,1000,1000,1000,1))
prob1 = 0.000517143
prob2 = 0.000213472
set.seed(1)
df = df %>%
mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
size = 1,
prob = prob1)),
condition == "B" ~ sum(rbinom(n = number,
size = 1,
prob = prob2)),
TRUE ~ 0))
print(df)
#> condition number output
#> 1 A 1000 0
#> 2 B 1000 0
#> 3 A 1000 0
#> 4 B 1000 0
#> 5 C 1 0
Here, it looks like the random binomial draws are being reused and returning all zeros.
For a check, here it is sampled repeatedly. Feasibly, the sum(df$output)
should be around 2 each draw.
for(i in 1:10){
df = df %>%
mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
size = 1,
prob = prob1)),
condition == "B" ~ sum(rbinom(n = number,
size = 1,
prob = prob2)),
TRUE ~ 0))
print(sum(df$output))}
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
Unsure of the way forward.
Share Improve this question asked Feb 3 at 2:26 geoscience123geoscience123 2301 silver badge17 bronze badges 3 |3 Answers
Reset to default 5Why are you summing draws of size 1? Refer to Wikipedia:
In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p).
Thus, you can sample once per row and don't need to sum. Since rbinom
is fully vectorized, you don't need a loop.
df <- merge(df, data.frame(condition = c("A", "B"),
prob = c(0.000517143, 0.000213472)),
by = "condition", all.x = TRUE)
df[is.na(df$prob), "prob"] <- 0
set.seed(1)
df$output <- with(df, rbinom(length(number), size = number, prob = prob))
# condition number prob output
#1 A 1000 0.000517143 0
#2 A 1000 0.000517143 0
#3 B 1000 0.000213472 0
#4 B 1000 0.000213472 1
#5 C 1 0.000000000 0
You could use mapply
:
set.seed(1)
df['output'] <- mapply(function(cond, num) sum(rbinom(n = num,
size = 1,
prob = ifelse(cond=="A", prob1,
ifelse(cond=="B", prob2, 0)))),
cond=df$condition, num=df$number)
df
condition number output
1 A 1000 1
2 B 1000 0
3 A 1000 1
4 B 1000 0
5 C 1 0
For a larger data frame (one with 100,000 rows), the above command takes about 5 seconds on my machine.
You don't really need tidyverse
for this problem.
You can avoid slow ifelse/case_when
calls using a probs list. Another advantage is improved clarity.
> probs <- list(A=0.000517143, B=0.000213472, C=0)
>
> set.seed(1)
> mapply(\(x, y) rbinom(n=x, size=1, prob=probs[[y]]) |> sum(),
+ df$number, df$condition)
[1] 1 0 1 0 0
Altogether:
> set.seed(1)
> df |>
+ transform(
+ out=mapply(\(x, y) rbinom(n=x, size=1, prob=probs[[y]]) |> sum(),
+ number, condition)
+ )
condition number out
1 A 1000 1
2 B 1000 0
3 A 1000 1
4 B 1000 0
5 C 1 0
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745251595a4618711.html
prob1
andprob2
constant, as in your example data, or do they vary? How many unique values doesnumber
take — is it substantially less than the number of rows? – zephryl Commented Feb 3 at 3:05