r - How to count number of instances over a threshold value between two dates? - Stack Overflow

So I have two data frames I am working with. The first contains information on a cohort of kids. These

So I have two data frames I am working with. The first contains information on a cohort of kids. These same kids have been measured multiple times on different days (so uniqueID AD_1 indicates child AD receiving their first set of measurements, and AD_2 indicates the same child receiving their second set of measurements). We are interested in the date windows between these measurement days, and have made columns start_date and end_date to define the windows, like so:

uniqueID start_date end_date
AD_1 2/22/2023 2/24/2023
KJ_1 2/22/2023 2/26/2023
AD_2 2/24/2023 3/8/2023
KJ_2 2/26/2023 3/2/2023
AD_3 3/8/2023 3/10/2023
KJ_3 3/2/2023 3/9/2023

So I have two data frames I am working with. The first contains information on a cohort of kids. These same kids have been measured multiple times on different days (so uniqueID AD_1 indicates child AD receiving their first set of measurements, and AD_2 indicates the same child receiving their second set of measurements). We are interested in the date windows between these measurement days, and have made columns start_date and end_date to define the windows, like so:

uniqueID start_date end_date
AD_1 2/22/2023 2/24/2023
KJ_1 2/22/2023 2/26/2023
AD_2 2/24/2023 3/8/2023
KJ_2 2/26/2023 3/2/2023
AD_3 3/8/2023 3/10/2023
KJ_3 3/2/2023 3/9/2023

The second data frame contains a continuous range of dates, and maximum daily temperatures for each date:

date maxtemp
2/22/2023 23.3
2/23/2023 18.8
2/24/2023 25.2
2/25/2023 27.0
2/26/2023 19.1
2/27/2023 20.2
2/28/2023 20.7
3/1/2023 26.1
3/2/2023 27.2
3/3/2023 20.8
3/4/2023 22.4
3/5/2023 21.8
3/6/2023 23.9
3/7/2023 24.9
3/8/2023 25.1
3/9/2023 22.7
3/10/2023 25.5

What I need to be able to do is count the number of days over 25° for each growth window (i.e., between each start and end date) as specified in the first data frame, using the temperature data in the second data frame.

I attempted to do this using the following code:

df1 %>% 
  transmute(uniqueID, day = map2(start_date, end_date, seq, by = "1 day")) %>%
  group_by(uniqueID) %>%
  summarise(count = df2$maxtemp >= 25)

However, this seemed to just generate a massive list of T/F values based on whether each day in the date window had a maximum temperature over 25°. Is there a way to have it add up the occurrences of days over 25°, to generate a new column in the first data frame, like this?

uniqueID start_date end_date daysover25
AD_1 2/22/2023 2/24/2023 1
KJ_1 2/22/2023 2/26/2023 2
AD_2 2/24/2023 3/8/2023 5
KJ_2 2/26/2023 3/2/2023 2
AD_3 3/8/2023 3/10/2023 1
KJ_3 3/2/2023 3/9/2023 2
Share Improve this question edited Mar 25 at 9:22 ThomasIsCoding 104k9 gold badges37 silver badges103 bronze badges asked Mar 24 at 23:05 Taylor HarmanTaylor Harman 734 bronze badges
Add a comment  | 

4 Answers 4

Reset to default 8

You can try

df1 %>%
  left_join(
    df2 %>%
      filter(maxtemp >= 25),
    by = join_by(start_date <= date, end_date >= date)
  ) %>%
  reframe(daysover25 = n(), .by = uniqueID:end_date)

which gives

  uniqueID start_date   end_date daysover25
1     AD_1 2023-02-22 2023-02-24          1
2     KJ_1 2023-02-22 2023-02-26          2
3     AD_2 2023-02-24 2023-03-08          5
4     KJ_2 2023-02-26 2023-03-02          2
5     AD_3 2023-03-08 2023-03-10          2
6     KJ_3 2023-03-02 2023-03-09          2

data

> dput(df1)
structure(list(uniqueID = c("AD_1", "KJ_1", "AD_2", "KJ_2", "AD_3",
"KJ_3"), start_date = structure(c(19410, 19410, 19412, 19414,
19424, 19418), class = "Date"), end_date = structure(c(19412,
19414, 19424, 19418, 19426, 19425), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))

> dput(df2)
structure(list(date = structure(c(19410, 19411, 19412, 19413,
19414, 19415, 19416, 19417, 19418, 19419, 19420, 19421, 19422,
19423, 19424, 19425, 19426), class = "Date"), maxtemp = c(23.3,
18.8, 25.2, 27, 19.1, 20.2, 20.7, 26.1, 27.2, 20.8, 22.4, 21.8,
23.9, 24.9, 25.1, 22.7, 25.5)), class = "data.frame", row.names = c(NA,
-17L))

Here's a data.table solution:

library(data.table)
setDT(df1)
setDT(df2)
df1[df2[maxtemp>=25], 
    on = .(start_date<=date, end_date>=date), 
    daysover25 := .N, by=uniqueID][]
___
   uniqueID start_date   end_date daysover25
     <char>     <Date>     <Date>      <int>
1:     AD_1 2023-02-22 2023-02-24          1
2:     KJ_1 2023-02-22 2023-02-26          2
3:     AD_2 2023-02-24 2023-03-08          5
4:     KJ_2 2023-02-26 2023-03-02          2
5:     AD_3 2023-03-08 2023-03-10          2
6:     KJ_3 2023-03-02 2023-03-09          2

You can sum up booleans, where TRUE == 1.

> sum(c(TRUE, TRUE, FALSE))
[1] 2

Avoid using row-wise operations whenever possible, as they are very slow. Exploit vectorization instead, as in this case on start_date and end_date vectors, where mapply() is your friend.

> df1 |>
+   transform(d25=mapply(\(x, y) {
+     sum(with(df2, maxtemp[date >= x & date <= y] >= 25))
+   }, start_date, end_date))
  uniqueID start_date   end_date d25
1     AD_1 2023-02-22 2023-02-24   1
2     KJ_1 2023-02-22 2023-02-26   2
3     AD_2 2023-02-24 2023-03-08   5
4     KJ_2 2023-02-26 2023-03-02   2
5     AD_3 2023-03-08 2023-03-10   2
6     KJ_3 2023-03-02 2023-03-09   2

Note that Date >= x & Date <= y includes both endpoints, which often causes overlap mistakes when a next interval starts at y.

Base R, use merge(), subset(), aggregate() to avoid loops. You should not use such overlapping intervals. Intervals should always be right-open, e.g. see here to start your research.

df2 |>
  merge(df1) |>
  subset(date>=start_date & date<=end_date, -date) |>
  aggregate(maxtemp~uniqueID, data=_, \(x) sum(x>=25)) |> 
  merge(df1) |>
  transform(daysover25=maxtemp, maxtemp=NULL)
  uniqueID start_date   end_date daysover25
1     AD_1 2023-02-22 2023-02-24          1
2     AD_2 2023-02-24 2023-03-08          5
3     AD_3 2023-03-08 2023-03-10          2
4     KJ_1 2023-02-22 2023-02-26          2
5     KJ_2 2023-02-26 2023-03-02          2
6     KJ_3 2023-03-02 2023-03-09          2

We could have subset for maxtemp>=25 earlier and aggregate lengths per uniqueID, i.e.

subset(df2, maxtemp>=25) |>
  merge(df1) |>
  subset(date>=start_date & date<=end_date, -date) |>
  aggregate(maxtemp~uniqueID, data=_, length) |>
  merge(df1) |>
  transform(daysover25=maxtemp, maxtemp=NULL)
  uniqueID start_date   end_date daysover25
1     AD_1 2023-02-22 2023-02-24          1
2     AD_2 2023-02-24 2023-03-08          5
3     AD_3 2023-03-08 2023-03-10          2
4     KJ_1 2023-02-22 2023-02-26          2
5     KJ_2 2023-02-26 2023-03-02          2
6     KJ_3 2023-03-02 2023-03-09          2

The last line is just re-naming (cosmetics). Credits to @ThomasIsCoding for providing reproducible easy-to-copy data.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744224555a4563942.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信