R optimisation: how can I avoid a for loop in this situation? - Stack Overflow

I've always heard for loops are to be avoided in R, but am not sure about how to make this better.

I've always heard for loops are to be avoided in R, but am not sure about how to make this better.

Here, I have this dataframe "with_gauge":

site_id      site_name              gauge     sample_datetime result
1     99020 Port Melbourne    burnley_229621A 2020-02-18 09:30:00     41
2     99020 Port Melbourne    burnley_229621A 2020-02-25 09:30:00     86
3     99020 Port Melbourne    burnley_229621A 2020-03-03 09:54:00     30

I also have a tibble "rainfall_data" with the hourly rainfall for each gauge:

   datetime            altona_587047 spotswood_230112A burnley_229621A st_kilda_marina_229670A 
   <dttm>                      <dbl>             <dbl>           <dbl>                   <dbl>               
 1 2012-11-01 01:00:00           0                 0               0                       0                    
 2 2012-11-01 02:00:00           0                 0               0                       0                   
 3 2012-11-01 03:00:00           0                 0               0                       0                   
 4 2012-11-01 04:00:00           0                 0               0                       0                   
 5 2012-11-01 05:00:00           0.2               0.2             1                       0.6                 
 6 2012-11-01 06:00:00           1.8               1               1.8                     1.4                 

I want to sum the rainfall that fell in the 24h before each sample, so I have written the following for loop to do this. It works, but I'm sure there is a better way of doing this, and in terms of improving my R skills I would like to learn them. Does anyone have any advice?

for (i in 1:nrow(with_gauge)) {
      date_ <- with_gauge[i, "sample_datetime"]
      date_ <- ceiling_date(date_, "hour")
      gauge_var <- with_gauge[i, "gauge"]
      rain_subset <- rainfall_data[c("datetime",gauge_var)] %>%
          filter(rainfall_data$datetime <= date_ &  
                 rainfall_data$datetime >= date_-hours(time_hours))
      rain_sum <- sum(rain_subset[,gauge_var])
      with_gauge$rain_sum[i] <- rain_sum
}

I've always heard for loops are to be avoided in R, but am not sure about how to make this better.

Here, I have this dataframe "with_gauge":

site_id      site_name              gauge     sample_datetime result
1     99020 Port Melbourne    burnley_229621A 2020-02-18 09:30:00     41
2     99020 Port Melbourne    burnley_229621A 2020-02-25 09:30:00     86
3     99020 Port Melbourne    burnley_229621A 2020-03-03 09:54:00     30

I also have a tibble "rainfall_data" with the hourly rainfall for each gauge:

   datetime            altona_587047 spotswood_230112A burnley_229621A st_kilda_marina_229670A 
   <dttm>                      <dbl>             <dbl>           <dbl>                   <dbl>               
 1 2012-11-01 01:00:00           0                 0               0                       0                    
 2 2012-11-01 02:00:00           0                 0               0                       0                   
 3 2012-11-01 03:00:00           0                 0               0                       0                   
 4 2012-11-01 04:00:00           0                 0               0                       0                   
 5 2012-11-01 05:00:00           0.2               0.2             1                       0.6                 
 6 2012-11-01 06:00:00           1.8               1               1.8                     1.4                 

I want to sum the rainfall that fell in the 24h before each sample, so I have written the following for loop to do this. It works, but I'm sure there is a better way of doing this, and in terms of improving my R skills I would like to learn them. Does anyone have any advice?

for (i in 1:nrow(with_gauge)) {
      date_ <- with_gauge[i, "sample_datetime"]
      date_ <- ceiling_date(date_, "hour")
      gauge_var <- with_gauge[i, "gauge"]
      rain_subset <- rainfall_data[c("datetime",gauge_var)] %>%
          filter(rainfall_data$datetime <= date_ &  
                 rainfall_data$datetime >= date_-hours(time_hours))
      rain_sum <- sum(rain_subset[,gauge_var])
      with_gauge$rain_sum[i] <- rain_sum
}
Share Improve this question edited Mar 21 at 5:17 thelatemail 94.1k12 gold badges139 silver badges196 bronze badges asked Mar 21 at 4:41 BoromiriBoromiri 211 bronze badge 4
  • 5 What is time_hours? And please provide the data using dput(). – Edward Commented Mar 21 at 5:41
  • 7 r-bloggers/2016/06/understanding-data-table-rolling-joins – Roland Commented Mar 21 at 6:43
  • 1 Your example data table leads to nowhere because the dates are so far away - could you select some sections of the data tabels where one can see some results together? – Gwang-Jin Kim Commented Mar 21 at 8:22
  • 1 It's not true, unless you are using vectorized functions. Those who use the *apply functions use them primarily for brevity rather than speed. – Carl Witthoft Commented Mar 21 at 13:45
Add a comment  | 

3 Answers 3

Reset to default 4

This is a rolling can easily be a range (non-equi) join. Using the sample data provided by Gwang-Jin Kim (thanks!):

with_gauge <- structure(list(site_id = c(96014, 96613, 91555, 98152, 93392), site_name = c("St Kilda", "St Kilda", "St Kilda", "Docklands", "St Kilda"), gauge = c("spotswood_230112A", "burnley_229621A", "st_kilda_marina_229670A", "st_kilda_marina_229670A", "spotswood_230112A"), sample_datetime = structure(c(1580846400, 1581116400, 1580893200, 1581253200, 1580979600), class = c("POSIXct", "POSIXt"), tzone = ""), result = c(65, 26, 23, 28, 36)), class = "data.frame", row.names = c(NA, -5L))
rainfall_data <- structure(list(datetime = structure(c(1580792400, 1580796000, 1580799600, 1580835600, 1580857200, 1580896800, 1580925600, 1580961600, 1580994000, 1581012000), class = c("POSIXct", "POSIXt"), tzone = ""), burnley_229621A = c(0.2692, 0.0103, 0.0201, 0.5738, 0.1512, 0.0427, 0.1674, 0.3791, 0.1138, 0.1525), altona_587047 = c(0.0291, 0.4796, 0.1237, 0.4694, 0.1174, 0.0573, 0.3422, 0.3729, 0.2967, 0.0863), spotswood_230112A = c(0.1115, 0.3026, 0.207, 0.171, 0.0091, 0.52, 0.1149, 0.0235, 0.319, 0.3026),      st_kilda_marina_229670A = c(0.2612, 0.0134, 0.1488, 0.3269, 0.0467, 0.0761, 0.3837, 0.0188, 0.2908, 0.0506)), class = "data.frame", row.names = c(NA, -10L))

We can do this in both dplyr/tidyr:

library(dplyr)
library(tidyr)     # pivot_longer
library(lubridate) # ceiling_date
time_before <- 24*3600
with_gauge |>
  mutate(prev24 = ceiling_date(sample_datetime, "hour") - time_before) |>
  left_join(pivot_longer(rainfall_data, cols=-datetime, names_to="gauge"), 
            join_by(gauge, between(y$datetime, x$prev24, x$sample_datetime))) |>
  summarize(.by = names(with_gauge), rain = sum(value))
#   site_id site_name                   gauge     sample_datetime result   rain
# 1   96014  St Kilda       spotswood_230112A 2020-02-04 15:00:00     65 0.7921
# 2   96613  St Kilda         burnley_229621A 2020-02-07 18:00:00     26     NA
# 3   91555  St Kilda st_kilda_marina_229670A 2020-02-05 04:00:00     23 0.3736
# 4   98152 Docklands st_kilda_marina_229670A 2020-02-09 08:00:00     28     NA
# 5   93392  St Kilda       spotswood_230112A 2020-02-06 04:00:00     36 0.6584

And in data.table:

library(data.table)
setDT(with_gauge)
setDT(rainfall_data)
time_before <- 24*3600
with_gauge[, prev24 := lubridate::ceiling_date(sample_datetime, "hour") - time_before ]
melt(rainfall_data, id.vars = "datetime", variable.name = "gauge", variable.factor = FALSE) |>
  _[with_gauge, on = .(gauge, datetime >= prev24, datetime <= sample_datetime)] |>
  setnames(c("datetime", "datetime.1"), c("sample_datetime", "prev24")) |>
  _[, .(rain = sum(value)), by = names(with_gauge)] |>
  _[, prev24 := NULL]
#    site_id site_name                   gauge     sample_datetime result   rain
#      <num>    <char>                  <char>              <POSc>  <num>  <num>
# 1:   96014  St Kilda       spotswood_230112A 2020-02-03 15:00:00     65 0.7921
# 2:   96613  St Kilda         burnley_229621A 2020-02-06 18:00:00     26     NA
# 3:   91555  St Kilda st_kilda_marina_229670A 2020-02-04 04:00:00     23 0.3736
# 4:   98152 Docklands st_kilda_marina_229670A 2020-02-08 08:00:00     28     NA
# 5:   93392  St Kilda       spotswood_230112A 2020-02-05 04:00:00     36 0.6584

(Note that using setDT converts both frames to "data.table" class. While it still inherits "data.frame", there are some nuances that can be different-enough to be frustrating if you aren't used to that. Feel free to do setDF(with_guage) after the above calcs, or alternatively add ... |> as.data.frame() at the end to do it inline, same for rainfall_data.)

You can try pivoting the rainfall data and then joining it to the gauge data like this:

library(dplyr)
library(tidyr)
library(lubridate)
with_gauge %>%
  mutate(hour=ceiling_date(sample_datetime, "hour"),
         hours=hour - hours(hour)) %>%
  inner_join(
    pivot_longer(rainfall_data, -datetime, names_to="gauge"),
    join_by(gauge, hour>=datetime, hours<=datetime)) %>%
  summarise(rain_sum=sum(value), .by=c(site_id, site_name, gauge, sample_datetime))

Giving:

  site_id      site_name           gauge     sample_datetime rain_sum
1   99020 Port Melbourne burnley_229621A 2020-02-18 09:30:00      2.8
2   99020 Port Melbourne burnley_229621A 2020-02-25 09:30:00      2.8
3   99020 Port Melbourne burnley_229621A 2020-03-03 09:54:00      2.8

Which (I think) matches your for-loop code).

___

Data:

with_gauge <- structure(list(site_id = c(99020, 99020, 99020), site_name = c("Port Melbourne", 
"Port Melbourne", "Port Melbourne"), gauge = c("burnley_229621A", 
"burnley_229621A", "burnley_229621A"), sample_datetime = structure(c(1581989400, 
1582594200, 1583200440), class = c("POSIXct", "POSIXt"), tzone = ""), 
    result = c(41, 86, 30), rain_sum = c(2.8, 2.8, 2.8)), row.names = c(NA, 
-3L), class = "data.frame")

rainfall_data <- structure(list(datetime = structure(c(1351702800, 1351706400, 
1351710000, 1351713600, 1351717200, 1351720800), class = c("POSIXct", 
"POSIXt"), tzone = ""), altona_587047 = c(0, 0, 0, 0, 0.2, 1.8
), spotswood_230112A = c(0, 0, 0, 0, 0.2, 1), burnley_229621A = c(0, 
0, 0, 0, 1, 1.8), st_kilda_marina_229670A = c(0, 0, 0, 0, 0.6, 
1.4)), row.names = c(NA, -6L), class = "data.frame")

I am using a more useful toy dataset - where the dates have some intersection at all:

with_gauge <- structure(list(
  site_id = c(96014, 96613, 91555, 98152, 93392),
  site_name = c("St Kilda", "St Kilda", "St Kilda", "Docklands", "St Kilda"),
  gauge = c("spotswood_230112A", "burnley_229621A", "st_kilda_marina_229670A", 
            "st_kilda_marina_229670A", "spotswood_230112A"),
  sample_datetime = as.POSIXct(c("2020-02-04 15:00:00", "2020-02-07 18:00:00", 
                                 "2020-02-05 04:00:00", "2020-02-09 08:00:00", 
                                 "2020-02-06 04:00:00")),
  result = c(65, 26, 23, 28, 36)
), class = "data.frame", row.names = c(NA, -5L))

rainfall_data <- structure(list(
  datetime = as.POSIXct(c("2020-02-04 00:00:00", "2020-02-04 01:00:00", 
                          "2020-02-04 02:00:00", "2020-02-04 12:00:00", 
                          "2020-02-04 18:00:00", "2020-02-05 05:00:00", 
                          "2020-02-05 13:00:00", "2020-02-05 23:00:00", 
                          "2020-02-06 08:00:00", "2020-02-06 13:00:00")),
  burnley_229621A = c(0.2692, 0.0103, 0.0201, 0.5738, 0.1512, 0.0427, 0.1674, 0.3791, 0.1138, 0.1525),
  altona_587047 = c(0.0291, 0.4796, 0.1237, 0.4694, 0.1174, 0.0573, 0.3422, 0.3729, 0.2967, 0.0863),
  spotswood_230112A = c(0.1115, 0.3026, 0.2070, 0.1710, 0.0091, 0.5200, 0.1149, 0.0235, 0.3190, 0.3026),
  st_kilda_marina_229670A = c(0.2612, 0.0134, 0.1488, 0.3269, 0.0467, 0.0761, 0.3837, 0.0188, 0.2908, 0.0506)
), class = "data.frame", row.names = c(NA, -10L))
library(data.table)
library(lubridate)

dt <- as.data.table(with_gauge)
rain_dt <- as.data.table(rainfall_data)

dt before run:

   site_id site_name                   gauge     sample_datetime result
     <num>    <char>                  <char>              <POSc>  <num>
1:   96014  St Kilda       spotswood_230112A 2020-02-04 15:00:00     65
2:   96613  St Kilda         burnley_229621A 2020-02-07 18:00:00     26
3:   91555  St Kilda st_kilda_marina_229670A 2020-02-05 04:00:00     23
4:   98152 Docklands st_kilda_marina_229670A 2020-02-09 08:00:00     28
5:   93392  St Kilda       spotswood_230112A 2020-02-06 04:00:00     36

run:

time_hours <- 24

dt[, rain_sum := mapply(function(sample_time, gauge) {
  end_time <- ceiling_date(sample_time, "hour")
  start_time <- end_time - hours(time_hours)
  rain_dt[datetime >= start_time & datetime <= end_time, sum(get(gauge), na.rm = TRUE)]
}, sample_datetime, gauge)]

dt after run:

   site_id site_name                   gauge     sample_datetime result
     <num>    <char>                  <char>              <POSc>  <num>
1:   96014  St Kilda       spotswood_230112A 2020-02-04 15:00:00     65
2:   96613  St Kilda         burnley_229621A 2020-02-07 18:00:00     26
3:   91555  St Kilda st_kilda_marina_229670A 2020-02-05 04:00:00     23
4:   98152 Docklands st_kilda_marina_229670A 2020-02-09 08:00:00     28
5:   93392  St Kilda       spotswood_230112A 2020-02-06 04:00:00     36
   rain_sum
      <num>
1:   0.7921
2:   0.0000
3:   0.3736
4:   0.0000

end_time <- ceiling_date(sample_datetime, "hour") extracts the time from the with_gauge table. start_time <- end_time - hours(time_hours) looks for the starting time 24 hours before.

The rain_dt[datetime >= start_time & datetime <= end_time, sum(get(gauge), na.rm = TRUE)] loks for all the rows in the rain_dt table which overlap with the timespan given by start_time and end_time. The get(gauge) gets the current gauge of the dt (with_gauge) table. and selects the gauge column in the rain_dt table. these values were added by sum() while NA values are ignored.

mapply(fun, input1, input2) takes input1 and input2 which are in this case dt's sample_datetime column and gauge column and goes through them row by row in parallel. and fun is applied to them.

So what we can conclude is whenever we have to loop through two or more entities in parallel, we can use mapply(fun, arg1, arg2, ...) where fun is a function which takes the elements of arg1, arg2, ... as its function arguments. And can process their elements in parallel.

for loop aversion

In functional programming languages like R, we have the apply family of functions (lapply, sapply, vapply - and like htere: mapply but the most general of all: tapply and apply) to make loops more "functional". In addition, vectorization is there to avoid loop syntax.

What apply family of functions do is they avoid one-off errors which are very common with for loops over indexes.

But you can avoid those kind of errors by a more Pythonic for loop usage:

for (element in myiterable) {
  # do sth with element
}

Vectorization is however the fastest.

But as I realized from my own experience: for loops come second in speed before apply functions!

All the apply functions rely heavily on looping over lists - and lists are very(!) slow in R (because they are very generic and many type checks have to be done for every element in the list!)

So I don't share the for loop aversion of the R folks - because I often - as a bioinformatician - in bioinformatics we deal with Gigabytes of raw data - I often observed that the apply functions are a speed bottleneck, since lists are (because of their generality) extremely slow.

My rule was: prefer vectorization over for loops over apply functions when speed is critical.

> library(microbenchmark)

# Create a list of 1 million numeric vectors, each of length 10
set.seed(42)
x <- replicate(1e5, runif(10), simplify = FALSE)

# for loop with preallocated result
sum_for <- function(lst) {
  out <- numeric(length(lst))
  for (i in seq_along(lst)) {
    out[i] <- sum(lst[[i]])
  }
  out
}

# lapply version
sum_lapply <- function(lst) {
  unlist(lapply(lst, sum))
}

# Benchmark both
microbenchmark(
  for_loop = sum_for(x),
  lapply_loop = sum_lapply(x),
  times = 10L
)
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
    for_loop 11.08882 11.52239 12.66243 13.03458 13.30823 14.19305    10
 lapply_loop 17.65583 18.23795 19.42082 18.75961 19.52637 23.16717    10

Or a little more complicated:

library(microbenchmark)

# Generate 500,000 numeric vectors of length 5
set.seed(42)
x <- replicate(5e5, runif(5), simplify = FALSE)

# For-loop with preallocated result vector
sum_for <- function(lst) {
  n <- length(lst)
  out <- numeric(n)
  for (i in seq_len(n)) {
    out[i] <- sum(lst[[i]])
  }
  out
}

# lapply version
sum_lapply <- function(lst) {
  unlist(lapply(lst, sum), use.names = FALSE)
}

# Benchmark both
microbenchmark(
  for_loop = sum_for(x),
  lapply_loop = sum_lapply(x),
  times = 5L
)
Unit: milliseconds
        expr      min       lq      mean    median        uq      max neval
    for_loop 57.86523 59.72093  61.76307  61.34662  64.82309  65.0595     5
 lapply_loop 92.52253 98.72029 110.52539 107.22714 116.78157 137.3754     5

But I observed sometimes big differences in everyday bioinformatics analyses.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744372918a4571044.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信