I've always heard for loops are to be avoided in R, but am not sure about how to make this better.
Here, I have this dataframe "with_gauge":
site_id site_name gauge sample_datetime result
1 99020 Port Melbourne burnley_229621A 2020-02-18 09:30:00 41
2 99020 Port Melbourne burnley_229621A 2020-02-25 09:30:00 86
3 99020 Port Melbourne burnley_229621A 2020-03-03 09:54:00 30
I also have a tibble "rainfall_data" with the hourly rainfall for each gauge:
datetime altona_587047 spotswood_230112A burnley_229621A st_kilda_marina_229670A
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2012-11-01 01:00:00 0 0 0 0
2 2012-11-01 02:00:00 0 0 0 0
3 2012-11-01 03:00:00 0 0 0 0
4 2012-11-01 04:00:00 0 0 0 0
5 2012-11-01 05:00:00 0.2 0.2 1 0.6
6 2012-11-01 06:00:00 1.8 1 1.8 1.4
I want to sum the rainfall that fell in the 24h before each sample, so I have written the following for loop to do this. It works, but I'm sure there is a better way of doing this, and in terms of improving my R skills I would like to learn them. Does anyone have any advice?
for (i in 1:nrow(with_gauge)) {
date_ <- with_gauge[i, "sample_datetime"]
date_ <- ceiling_date(date_, "hour")
gauge_var <- with_gauge[i, "gauge"]
rain_subset <- rainfall_data[c("datetime",gauge_var)] %>%
filter(rainfall_data$datetime <= date_ &
rainfall_data$datetime >= date_-hours(time_hours))
rain_sum <- sum(rain_subset[,gauge_var])
with_gauge$rain_sum[i] <- rain_sum
}
I've always heard for loops are to be avoided in R, but am not sure about how to make this better.
Here, I have this dataframe "with_gauge":
site_id site_name gauge sample_datetime result
1 99020 Port Melbourne burnley_229621A 2020-02-18 09:30:00 41
2 99020 Port Melbourne burnley_229621A 2020-02-25 09:30:00 86
3 99020 Port Melbourne burnley_229621A 2020-03-03 09:54:00 30
I also have a tibble "rainfall_data" with the hourly rainfall for each gauge:
datetime altona_587047 spotswood_230112A burnley_229621A st_kilda_marina_229670A
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2012-11-01 01:00:00 0 0 0 0
2 2012-11-01 02:00:00 0 0 0 0
3 2012-11-01 03:00:00 0 0 0 0
4 2012-11-01 04:00:00 0 0 0 0
5 2012-11-01 05:00:00 0.2 0.2 1 0.6
6 2012-11-01 06:00:00 1.8 1 1.8 1.4
I want to sum the rainfall that fell in the 24h before each sample, so I have written the following for loop to do this. It works, but I'm sure there is a better way of doing this, and in terms of improving my R skills I would like to learn them. Does anyone have any advice?
for (i in 1:nrow(with_gauge)) {
date_ <- with_gauge[i, "sample_datetime"]
date_ <- ceiling_date(date_, "hour")
gauge_var <- with_gauge[i, "gauge"]
rain_subset <- rainfall_data[c("datetime",gauge_var)] %>%
filter(rainfall_data$datetime <= date_ &
rainfall_data$datetime >= date_-hours(time_hours))
rain_sum <- sum(rain_subset[,gauge_var])
with_gauge$rain_sum[i] <- rain_sum
}
Share
Improve this question
edited Mar 21 at 5:17
thelatemail
94.1k12 gold badges139 silver badges196 bronze badges
asked Mar 21 at 4:41
BoromiriBoromiri
211 bronze badge
4
|
3 Answers
Reset to default 4This is a rolling can easily be a range (non-equi) join. Using the sample data provided by Gwang-Jin Kim (thanks!):
with_gauge <- structure(list(site_id = c(96014, 96613, 91555, 98152, 93392), site_name = c("St Kilda", "St Kilda", "St Kilda", "Docklands", "St Kilda"), gauge = c("spotswood_230112A", "burnley_229621A", "st_kilda_marina_229670A", "st_kilda_marina_229670A", "spotswood_230112A"), sample_datetime = structure(c(1580846400, 1581116400, 1580893200, 1581253200, 1580979600), class = c("POSIXct", "POSIXt"), tzone = ""), result = c(65, 26, 23, 28, 36)), class = "data.frame", row.names = c(NA, -5L))
rainfall_data <- structure(list(datetime = structure(c(1580792400, 1580796000, 1580799600, 1580835600, 1580857200, 1580896800, 1580925600, 1580961600, 1580994000, 1581012000), class = c("POSIXct", "POSIXt"), tzone = ""), burnley_229621A = c(0.2692, 0.0103, 0.0201, 0.5738, 0.1512, 0.0427, 0.1674, 0.3791, 0.1138, 0.1525), altona_587047 = c(0.0291, 0.4796, 0.1237, 0.4694, 0.1174, 0.0573, 0.3422, 0.3729, 0.2967, 0.0863), spotswood_230112A = c(0.1115, 0.3026, 0.207, 0.171, 0.0091, 0.52, 0.1149, 0.0235, 0.319, 0.3026), st_kilda_marina_229670A = c(0.2612, 0.0134, 0.1488, 0.3269, 0.0467, 0.0761, 0.3837, 0.0188, 0.2908, 0.0506)), class = "data.frame", row.names = c(NA, -10L))
We can do this in both dplyr/tidyr:
library(dplyr)
library(tidyr) # pivot_longer
library(lubridate) # ceiling_date
time_before <- 24*3600
with_gauge |>
mutate(prev24 = ceiling_date(sample_datetime, "hour") - time_before) |>
left_join(pivot_longer(rainfall_data, cols=-datetime, names_to="gauge"),
join_by(gauge, between(y$datetime, x$prev24, x$sample_datetime))) |>
summarize(.by = names(with_gauge), rain = sum(value))
# site_id site_name gauge sample_datetime result rain
# 1 96014 St Kilda spotswood_230112A 2020-02-04 15:00:00 65 0.7921
# 2 96613 St Kilda burnley_229621A 2020-02-07 18:00:00 26 NA
# 3 91555 St Kilda st_kilda_marina_229670A 2020-02-05 04:00:00 23 0.3736
# 4 98152 Docklands st_kilda_marina_229670A 2020-02-09 08:00:00 28 NA
# 5 93392 St Kilda spotswood_230112A 2020-02-06 04:00:00 36 0.6584
And in data.table
:
library(data.table)
setDT(with_gauge)
setDT(rainfall_data)
time_before <- 24*3600
with_gauge[, prev24 := lubridate::ceiling_date(sample_datetime, "hour") - time_before ]
melt(rainfall_data, id.vars = "datetime", variable.name = "gauge", variable.factor = FALSE) |>
_[with_gauge, on = .(gauge, datetime >= prev24, datetime <= sample_datetime)] |>
setnames(c("datetime", "datetime.1"), c("sample_datetime", "prev24")) |>
_[, .(rain = sum(value)), by = names(with_gauge)] |>
_[, prev24 := NULL]
# site_id site_name gauge sample_datetime result rain
# <num> <char> <char> <POSc> <num> <num>
# 1: 96014 St Kilda spotswood_230112A 2020-02-03 15:00:00 65 0.7921
# 2: 96613 St Kilda burnley_229621A 2020-02-06 18:00:00 26 NA
# 3: 91555 St Kilda st_kilda_marina_229670A 2020-02-04 04:00:00 23 0.3736
# 4: 98152 Docklands st_kilda_marina_229670A 2020-02-08 08:00:00 28 NA
# 5: 93392 St Kilda spotswood_230112A 2020-02-05 04:00:00 36 0.6584
(Note that using setDT
converts both frames to "data.table"
class. While it still inherits "data.frame"
, there are some nuances that can be different-enough to be frustrating if you aren't used to that. Feel free to do setDF(with_guage)
after the above calcs, or alternatively add ... |> as.data.frame()
at the end to do it inline, same for rainfall_data
.)
You can try pivoting the rainfall data and then joining it to the gauge data like this:
library(dplyr)
library(tidyr)
library(lubridate)
with_gauge %>%
mutate(hour=ceiling_date(sample_datetime, "hour"),
hours=hour - hours(hour)) %>%
inner_join(
pivot_longer(rainfall_data, -datetime, names_to="gauge"),
join_by(gauge, hour>=datetime, hours<=datetime)) %>%
summarise(rain_sum=sum(value), .by=c(site_id, site_name, gauge, sample_datetime))
Giving:
site_id site_name gauge sample_datetime rain_sum
1 99020 Port Melbourne burnley_229621A 2020-02-18 09:30:00 2.8
2 99020 Port Melbourne burnley_229621A 2020-02-25 09:30:00 2.8
3 99020 Port Melbourne burnley_229621A 2020-03-03 09:54:00 2.8
Which (I think) matches your for-loop code).
___
Data:
with_gauge <- structure(list(site_id = c(99020, 99020, 99020), site_name = c("Port Melbourne",
"Port Melbourne", "Port Melbourne"), gauge = c("burnley_229621A",
"burnley_229621A", "burnley_229621A"), sample_datetime = structure(c(1581989400,
1582594200, 1583200440), class = c("POSIXct", "POSIXt"), tzone = ""),
result = c(41, 86, 30), rain_sum = c(2.8, 2.8, 2.8)), row.names = c(NA,
-3L), class = "data.frame")
rainfall_data <- structure(list(datetime = structure(c(1351702800, 1351706400,
1351710000, 1351713600, 1351717200, 1351720800), class = c("POSIXct",
"POSIXt"), tzone = ""), altona_587047 = c(0, 0, 0, 0, 0.2, 1.8
), spotswood_230112A = c(0, 0, 0, 0, 0.2, 1), burnley_229621A = c(0,
0, 0, 0, 1, 1.8), st_kilda_marina_229670A = c(0, 0, 0, 0, 0.6,
1.4)), row.names = c(NA, -6L), class = "data.frame")
I am using a more useful toy dataset - where the dates have some intersection at all:
with_gauge <- structure(list(
site_id = c(96014, 96613, 91555, 98152, 93392),
site_name = c("St Kilda", "St Kilda", "St Kilda", "Docklands", "St Kilda"),
gauge = c("spotswood_230112A", "burnley_229621A", "st_kilda_marina_229670A",
"st_kilda_marina_229670A", "spotswood_230112A"),
sample_datetime = as.POSIXct(c("2020-02-04 15:00:00", "2020-02-07 18:00:00",
"2020-02-05 04:00:00", "2020-02-09 08:00:00",
"2020-02-06 04:00:00")),
result = c(65, 26, 23, 28, 36)
), class = "data.frame", row.names = c(NA, -5L))
rainfall_data <- structure(list(
datetime = as.POSIXct(c("2020-02-04 00:00:00", "2020-02-04 01:00:00",
"2020-02-04 02:00:00", "2020-02-04 12:00:00",
"2020-02-04 18:00:00", "2020-02-05 05:00:00",
"2020-02-05 13:00:00", "2020-02-05 23:00:00",
"2020-02-06 08:00:00", "2020-02-06 13:00:00")),
burnley_229621A = c(0.2692, 0.0103, 0.0201, 0.5738, 0.1512, 0.0427, 0.1674, 0.3791, 0.1138, 0.1525),
altona_587047 = c(0.0291, 0.4796, 0.1237, 0.4694, 0.1174, 0.0573, 0.3422, 0.3729, 0.2967, 0.0863),
spotswood_230112A = c(0.1115, 0.3026, 0.2070, 0.1710, 0.0091, 0.5200, 0.1149, 0.0235, 0.3190, 0.3026),
st_kilda_marina_229670A = c(0.2612, 0.0134, 0.1488, 0.3269, 0.0467, 0.0761, 0.3837, 0.0188, 0.2908, 0.0506)
), class = "data.frame", row.names = c(NA, -10L))
library(data.table)
library(lubridate)
dt <- as.data.table(with_gauge)
rain_dt <- as.data.table(rainfall_data)
dt
before run:
site_id site_name gauge sample_datetime result
<num> <char> <char> <POSc> <num>
1: 96014 St Kilda spotswood_230112A 2020-02-04 15:00:00 65
2: 96613 St Kilda burnley_229621A 2020-02-07 18:00:00 26
3: 91555 St Kilda st_kilda_marina_229670A 2020-02-05 04:00:00 23
4: 98152 Docklands st_kilda_marina_229670A 2020-02-09 08:00:00 28
5: 93392 St Kilda spotswood_230112A 2020-02-06 04:00:00 36
run:
time_hours <- 24
dt[, rain_sum := mapply(function(sample_time, gauge) {
end_time <- ceiling_date(sample_time, "hour")
start_time <- end_time - hours(time_hours)
rain_dt[datetime >= start_time & datetime <= end_time, sum(get(gauge), na.rm = TRUE)]
}, sample_datetime, gauge)]
dt
after run:
site_id site_name gauge sample_datetime result
<num> <char> <char> <POSc> <num>
1: 96014 St Kilda spotswood_230112A 2020-02-04 15:00:00 65
2: 96613 St Kilda burnley_229621A 2020-02-07 18:00:00 26
3: 91555 St Kilda st_kilda_marina_229670A 2020-02-05 04:00:00 23
4: 98152 Docklands st_kilda_marina_229670A 2020-02-09 08:00:00 28
5: 93392 St Kilda spotswood_230112A 2020-02-06 04:00:00 36
rain_sum
<num>
1: 0.7921
2: 0.0000
3: 0.3736
4: 0.0000
end_time <- ceiling_date(sample_datetime, "hour")
extracts the time from the with_gauge
table.
start_time <- end_time - hours(time_hours)
looks for the starting time 24 hours before.
The rain_dt[datetime >= start_time & datetime <= end_time, sum(get(gauge), na.rm = TRUE)]
loks for all the rows in the rain_dt
table which overlap with the timespan given by start_time and end_time.
The get(gauge)
gets the current gauge
of the dt
(with_gauge) table. and selects the gauge column in the rain_dt table. these values were added by sum()
while NA values are ignored.
mapply(fun, input1, input2)
takes input1 and input2 which are in this case dt
's sample_datetime
column and gauge
column and goes through them row by row in parallel. and fun
is applied to them.
So what we can conclude is whenever we have to loop through two or more entities in parallel, we can use mapply(fun, arg1, arg2, ...)
where fun
is a function which takes the elements of arg1, arg2, ... as its function arguments. And can process their elements in parallel.
for
loop aversion
In functional programming languages like R, we have the apply
family of functions (lapply
, sapply
, vapply
- and like htere: mapply
but the most general of all: tapply
and apply
) to make loops more "functional".
In addition, vectorization is there to avoid loop syntax.
What apply family of functions do is they avoid one-off errors which are very common with for loops over indexes.
But you can avoid those kind of errors by a more Pythonic for loop usage:
for (element in myiterable) {
# do sth with element
}
Vectorization is however the fastest.
But as I realized from my own experience: for
loops come second in speed before apply functions!
All the apply
functions rely heavily on looping over lists - and lists are very(!) slow in R (because they are very generic and many type checks have to be done for every element in the list!)
So I don't share the for
loop aversion of the R folks - because I often - as a bioinformatician - in bioinformatics we deal with Gigabytes of raw data - I often observed that the apply functions are a speed bottleneck, since lists are (because of their generality) extremely slow.
My rule was: prefer vectorization over for loops over apply functions when speed is critical.
> library(microbenchmark)
# Create a list of 1 million numeric vectors, each of length 10
set.seed(42)
x <- replicate(1e5, runif(10), simplify = FALSE)
# for loop with preallocated result
sum_for <- function(lst) {
out <- numeric(length(lst))
for (i in seq_along(lst)) {
out[i] <- sum(lst[[i]])
}
out
}
# lapply version
sum_lapply <- function(lst) {
unlist(lapply(lst, sum))
}
# Benchmark both
microbenchmark(
for_loop = sum_for(x),
lapply_loop = sum_lapply(x),
times = 10L
)
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 11.08882 11.52239 12.66243 13.03458 13.30823 14.19305 10
lapply_loop 17.65583 18.23795 19.42082 18.75961 19.52637 23.16717 10
Or a little more complicated:
library(microbenchmark)
# Generate 500,000 numeric vectors of length 5
set.seed(42)
x <- replicate(5e5, runif(5), simplify = FALSE)
# For-loop with preallocated result vector
sum_for <- function(lst) {
n <- length(lst)
out <- numeric(n)
for (i in seq_len(n)) {
out[i] <- sum(lst[[i]])
}
out
}
# lapply version
sum_lapply <- function(lst) {
unlist(lapply(lst, sum), use.names = FALSE)
}
# Benchmark both
microbenchmark(
for_loop = sum_for(x),
lapply_loop = sum_lapply(x),
times = 5L
)
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 57.86523 59.72093 61.76307 61.34662 64.82309 65.0595 5
lapply_loop 92.52253 98.72029 110.52539 107.22714 116.78157 137.3754 5
But I observed sometimes big differences in everyday bioinformatics analyses.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744372918a4571044.html
time_hours
? And please provide the data usingdput()
. – Edward Commented Mar 21 at 5:41*apply
functions use them primarily for brevity rather than speed. – Carl Witthoft Commented Mar 21 at 13:45