So I have two data frames I am working with. The first contains information on a cohort of kids. These same kids have been measured multiple times on different days (so uniqueID AD_1 indicates child AD receiving their first set of measurements, and AD_2 indicates the same child receiving their second set of measurements). We are interested in the date windows between these measurement days, and have made columns start_date and end_date to define the windows, like so:

uniqueID	start_date	end_date
AD_1	2/22/2023	2/24/2023
KJ_1	2/22/2023	2/26/2023
AD_2	2/24/2023	3/8/2023
KJ_2	2/26/2023	3/2/2023
AD_3	3/8/2023	3/10/2023
KJ_3	3/2/2023	3/9/2023

uniqueID	start_date	end_date
AD_1	2/22/2023	2/24/2023
KJ_1	2/22/2023	2/26/2023
AD_2	2/24/2023	3/8/2023
KJ_2	2/26/2023	3/2/2023
AD_3	3/8/2023	3/10/2023
KJ_3	3/2/2023	3/9/2023

The second data frame contains a continuous range of dates, and maximum daily temperatures for each date:

date	maxtemp
2/22/2023	23.3
2/23/2023	18.8
2/24/2023	25.2
2/25/2023	27.0
2/26/2023	19.1
2/27/2023	20.2
2/28/2023	20.7
3/1/2023	26.1
3/2/2023	27.2
3/3/2023	20.8
3/4/2023	22.4
3/5/2023	21.8
3/6/2023	23.9
3/7/2023	24.9
3/8/2023	25.1
3/9/2023	22.7
3/10/2023	25.5

What I need to be able to do is count the number of days over 25° for each growth window (i.e., between each start and end date) as specified in the first data frame, using the temperature data in the second data frame.

I attempted to do this using the following code:

df1 %>% 
  transmute(uniqueID, day = map2(start_date, end_date, seq, by = "1 day")) %>%
  group_by(uniqueID) %>%
  summarise(count = df2$maxtemp >= 25)

However, this seemed to just generate a massive list of T/F values based on whether each day in the date window had a maximum temperature over 25°. Is there a way to have it add up the occurrences of days over 25°, to generate a new column in the first data frame, like this?

uniqueID	start_date	end_date	daysover25
AD_1	2/22/2023	2/24/2023	1
KJ_1	2/22/2023	2/26/2023	2
AD_2	2/24/2023	3/8/2023	5
KJ_2	2/26/2023	3/2/2023	2
AD_3	3/8/2023	3/10/2023	1
KJ_3	3/2/2023	3/9/2023	2

Share Improve this question edited Mar 25 at 9:22 ThomasIsCoding 104k9 gold badges37 silver badges103 bronze badges asked Mar 24 at 23:05 Taylor Harman 734 bronze badges

Add a comment |

4 Answers 4

Sorted by: Reset to default 8

You can try

df1 %>%
  left_join(
    df2 %>%
      filter(maxtemp >= 25),
    by = join_by(start_date <= date, end_date >= date)
  ) %>%
  reframe(daysover25 = n(), .by = uniqueID:end_date)

which gives

  uniqueID start_date   end_date daysover25
1     AD_1 2023-02-22 2023-02-24          1
2     KJ_1 2023-02-22 2023-02-26          2
3     AD_2 2023-02-24 2023-03-08          5
4     KJ_2 2023-02-26 2023-03-02          2
5     AD_3 2023-03-08 2023-03-10          2
6     KJ_3 2023-03-02 2023-03-09          2

data

> dput(df1)
structure(list(uniqueID = c("AD_1", "KJ_1", "AD_2", "KJ_2", "AD_3",
"KJ_3"), start_date = structure(c(19410, 19410, 19412, 19414,
19424, 19418), class = "Date"), end_date = structure(c(19412,
19414, 19424, 19418, 19426, 19425), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))

> dput(df2)
structure(list(date = structure(c(19410, 19411, 19412, 19413,
19414, 19415, 19416, 19417, 19418, 19419, 19420, 19421, 19422,
19423, 19424, 19425, 19426), class = "Date"), maxtemp = c(23.3,
18.8, 25.2, 27, 19.1, 20.2, 20.7, 26.1, 27.2, 20.8, 22.4, 21.8,
23.9, 24.9, 25.1, 22.7, 25.5)), class = "data.frame", row.names = c(NA,
-17L))

Here's a data.table solution:

library(data.table)
setDT(df1)
setDT(df2)

df1[df2[maxtemp>=25], 
    on = .(start_date<=date, end_date>=date), 
    daysover25 := .N, by=uniqueID][]
___
   uniqueID start_date   end_date daysover25
     <char>     <Date>     <Date>      <int>
1:     AD_1 2023-02-22 2023-02-24          1
2:     KJ_1 2023-02-22 2023-02-26          2
3:     AD_2 2023-02-24 2023-03-08          5
4:     KJ_2 2023-02-26 2023-03-02          2
5:     AD_3 2023-03-08 2023-03-10          2
6:     KJ_3 2023-03-02 2023-03-09          2

You can sum up booleans, where TRUE == 1.

> sum(c(TRUE, TRUE, FALSE))
[1] 2

Avoid using row-wise operations whenever possible, as they are very slow. Exploit vectorization instead, as in this case on start_date and end_date vectors, where mapply() is your friend.

> df1 |>
+   transform(d25=mapply(\(x, y) {
+     sum(with(df2, maxtemp[date >= x & date <= y] >= 25))
+   }, start_date, end_date))
  uniqueID start_date   end_date d25
1     AD_1 2023-02-22 2023-02-24   1
2     KJ_1 2023-02-22 2023-02-26   2
3     AD_2 2023-02-24 2023-03-08   5
4     KJ_2 2023-02-26 2023-03-02   2
5     AD_3 2023-03-08 2023-03-10   2
6     KJ_3 2023-03-02 2023-03-09   2

Note that Date >= x & Date <= y includes both endpoints, which often causes overlap mistakes when a next interval starts at y.

Base R, use merge(), subset(), aggregate() to avoid loops. You should not use such overlapping intervals. Intervals should always be right-open, e.g. see here to start your research.

df2 |>
  merge(df1) |>
  subset(date>=start_date & date<=end_date, -date) |>
  aggregate(maxtemp~uniqueID, data=_, \(x) sum(x>=25)) |> 
  merge(df1) |>
  transform(daysover25=maxtemp, maxtemp=NULL)

  uniqueID start_date   end_date daysover25
1     AD_1 2023-02-22 2023-02-24          1
2     AD_2 2023-02-24 2023-03-08          5
3     AD_3 2023-03-08 2023-03-10          2
4     KJ_1 2023-02-22 2023-02-26          2
5     KJ_2 2023-02-26 2023-03-02          2
6     KJ_3 2023-03-02 2023-03-09          2

We could have subset for maxtemp>=25 earlier and aggregate lengths per uniqueID, i.e.

subset(df2, maxtemp>=25) |>
  merge(df1) |>
  subset(date>=start_date & date<=end_date, -date) |>
  aggregate(maxtemp~uniqueID, data=_, length) |>
  merge(df1) |>
  transform(daysover25=maxtemp, maxtemp=NULL)

  uniqueID start_date   end_date daysover25
1     AD_1 2023-02-22 2023-02-24          1
2     AD_2 2023-02-24 2023-03-08          5
3     AD_3 2023-03-08 2023-03-10          2
4     KJ_1 2023-02-22 2023-02-26          2
5     KJ_2 2023-02-26 2023-03-02          2
6     KJ_3 2023-03-02 2023-03-09          2

The last line is just re-naming (cosmetics). Credits to @ThomasIsCoding for providing reproducible easy-to-copy data.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744224555a4563942.html

r - How to count number of instances over a threshold value between two dates? - Stack Overflow

4 Answers 4

data

发表回复

评论列表（0条）

联系我们

400-800-8888

r - How to count number of instances over a threshold value between two dates? - Stack Overflow

4 Answers 4

data

相关推荐

r - How to count number of instances over a threshold value between two dates? - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888