I have some data from five temperature loggers as exported as csvs. Frustratingly it exports the data into a csv with two different date formats either mm/dd/yyyy hh:mm OR mm/dd/yy hh:mm:ss AM/PM in my DateTime_BST column.
I've tried changing the format of the data directly in in the csv but it does not work, it just reverts back to two different states or doesn't change at all. So I thought I could do it in R but I'm not really sure how.
I have the code below which imports all five files into R and combines them into one dataset
hobo_files <- list.files(path = "HOBO_DAT", pattern = "\\.csv$", full.names = TRUE)
hobo_dat <- rbindlist(lapply(hobo_files, fread), use.names = TRUE, fill = TRUE)
str(hobo_dat)
str(hobo_dat)
Classes ‘data.table’ and 'data.frame': 103375 obs. of 6 variables:
$ hobo_sn : int 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 ...
$ rec_sn : int 461703 461703 461703 461703 461703 461703 461703 461703 461703 461703 ...
$ DateTime_BST: chr "06/09/2024 14:13" "06/09/2024 14:13" "06/09/2024 14:23" "06/09/2024 14:33" ...
$ Temp : num 19.7 NA 20.9 21.3 21.6 ...
$ DateTime_in : chr "12/06/2024 09:32" "12/06/2024 09:32" "12/06/2024 09:32" "12/06/2024 09:32" ...
$ DateTime_out: chr "06/11/2024 16:05" "06/11/2024 16:05" "06/11/2024 16:05" "06/11/2024 16:05" ...
How do I get a standardised date format for my DateTime_BST column?
I have some data from five temperature loggers as exported as csvs. Frustratingly it exports the data into a csv with two different date formats either mm/dd/yyyy hh:mm OR mm/dd/yy hh:mm:ss AM/PM in my DateTime_BST column.
I've tried changing the format of the data directly in in the csv but it does not work, it just reverts back to two different states or doesn't change at all. So I thought I could do it in R but I'm not really sure how.
I have the code below which imports all five files into R and combines them into one dataset
hobo_files <- list.files(path = "HOBO_DAT", pattern = "\\.csv$", full.names = TRUE)
hobo_dat <- rbindlist(lapply(hobo_files, fread), use.names = TRUE, fill = TRUE)
str(hobo_dat)
str(hobo_dat)
Classes ‘data.table’ and 'data.frame': 103375 obs. of 6 variables:
$ hobo_sn : int 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 ...
$ rec_sn : int 461703 461703 461703 461703 461703 461703 461703 461703 461703 461703 ...
$ DateTime_BST: chr "06/09/2024 14:13" "06/09/2024 14:13" "06/09/2024 14:23" "06/09/2024 14:33" ...
$ Temp : num 19.7 NA 20.9 21.3 21.6 ...
$ DateTime_in : chr "12/06/2024 09:32" "12/06/2024 09:32" "12/06/2024 09:32" "12/06/2024 09:32" ...
$ DateTime_out: chr "06/11/2024 16:05" "06/11/2024 16:05" "06/11/2024 16:05" "06/11/2024 16:05" ...
How do I get a standardised date format for my DateTime_BST column?
Share Improve this question asked Mar 6 at 14:42 mikejwilliamsonmikejwilliamson 5192 gold badges9 silver badges22 bronze badges 3 |2 Answers
Reset to default 2Revision of my previous answer that addresses y2k ambiguity.
psxfun <- function(x) {
candidates <- c("%d/%m/%y %H:%M:%S", "%d/%m/%y %H:%M", "%Y-%m-%d %H:%M:%S", "%Y-%m-%d %H:%M", "%d/%m/%Y %H:%M:%S", "%d/%m/%Y %H:%M")
out <- rep(.POSIXct(NA), length = length(x))
for (fmt in candidates) {
if (!length(isna <- is.na(out))) break
out[isna] <- as.POSIXct(x[isna], format = fmt)
}
out
}
The order of formats in candidate_formats
is important; for instance, %y
will not match 2024
but %Y
will match 24
and parse not how we intend here. Also, aside from the 2y/4y thing, it is slightly more efficient to have more-likely candidates first, as it will reduce repeats in the loop and calls to as.POSIXct
; this is mostly a theoretical preference until you're dealing with millions of rows, where the extra calls might be palpable (though still manageable).
Sample data, where the last column has some 2y/4y ambiguity (mixed, even):
hobo_dat <- data.table::as.data.table(structure(list(hobo_sn = c(21863784L, 21863784L, 21863784L, 21863784L, 21863784L), rec_sn = c(461703L, 461703L, 461703L, 461703L, 461703L), DateTime_BST = c("06/09/2024 14:13", "06/09/2024 14:13", "06/09/2024 14:23", "06/09/2024 14:33", NA), Temp = c(19.7, NA, 20.9, 21.3, 21.6), DateTime_in = c("12/06/2024 09:32", "12/06/2024 09:32", "12/06/2024 09:32", "12/06/2024 09:32", NA), DateTime_out = c("06/11/24 16:05", "06/11/24 16:05", "06/11/2024 16:05", "06/11/2024 16:05", NA)), row.names = c(NA, -5L ), class = c("data.table", "data.frame")))
hobo_dat
# hobo_sn rec_sn DateTime_BST Temp DateTime_in DateTime_out
# <int> <int> <char> <num> <char> <char>
# 1: 21863784 461703 06/09/2024 14:13 19.7 12/06/2024 09:32 06/11/24 16:05
# 2: 21863784 461703 06/09/2024 14:13 NA 12/06/2024 09:32 06/11/24 16:05
# 3: 21863784 461703 06/09/2024 14:23 20.9 12/06/2024 09:32 06/11/2024 16:05
# 4: 21863784 461703 06/09/2024 14:33 21.3 12/06/2024 09:32 06/11/2024 16:05
# 5: 21863784 461703 <NA> 21.6 <NA> <NA>
Each column, manually:
psxfun(hobo_dat$DateTime_BST)
# [1] "2024-09-06 14:13:00 EDT" "2024-09-06 14:13:00 EDT" "2024-09-06 14:23:00 EDT" "2024-09-06 14:33:00 EDT" NA
psxfun(hobo_dat$DateTime_in)
# [1] "2024-06-12 09:32:00 EDT" "2024-06-12 09:32:00 EDT" "2024-06-12 09:32:00 EDT" "2024-06-12 09:32:00 EDT" NA
psxfun(hobo_dat$DateTime_out)
# [1] "2024-11-06 16:05:00 EST" "2024-11-06 16:05:00 EST" "2024-11-06 16:05:00 EST" "2024-11-06 16:05:00 EST" NA
Since this is a data.table
, assuming 1.16.0 or newer (for the names(.SD) :=
part):
hobo_dat[, names(.SD) := lapply(.SD, psxfun), .SDcols = c("DateTime_BST", "DateTime_in", "DateTime_out")]
# hobo_sn rec_sn DateTime_BST Temp DateTime_in DateTime_out
# <int> <int> <POSc> <num> <POSc> <POSc>
# 1: 21863784 461703 2024-09-06 14:13:00 19.7 2024-06-12 09:32:00 2024-11-06 16:05:00
# 2: 21863784 461703 2024-09-06 14:13:00 NA 2024-06-12 09:32:00 2024-11-06 16:05:00
# 3: 21863784 461703 2024-09-06 14:23:00 20.9 2024-06-12 09:32:00 2024-11-06 16:05:00
# 4: 21863784 461703 2024-09-06 14:33:00 21.3 2024-06-12 09:32:00 2024-11-06 16:05:00
# 5: 21863784 461703 <NA> 21.6 <NA> <NA>
though it can be done just as easily with either of
## base R, even on a `data.table` object
for (nm in c("DateTime_BST", "DateTime_in", "DateTime_out")) hobo_dat[[nm]] <- psxfun(hobo_dat[[nm]])
## dplyr
library(dplyr)
hobo_dat <- mutate(hobo_dat, across(c(DateTime_BST, DateTime_in, DateTime_out), ~ psxfun(.x)))
I ended using an ifelse function to detect where rows contained an AM/PM and then specifying format using that, which works pretty well I think.
hobo_dat$DateTime_BST_cor <- ifelse(
grepl("AM|PM", hobo_dat$DateTime_BST), # If the datetime string contains AM/PM
as.POSIXct(hobo_dat$DateTime_BST, format = "%m/%d/%y %I:%M:%S %p"), # Parse with 12-hour format (AM/PM)
as.POSIXct(hobo_dat$DateTime_BST, format = "%m/%d/%Y %H:%M") # Parse with 24-hour format
)
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744968879a4603846.html
as.POSIXct(hobo_dat$DateTime_BST, format = "%m/%d/%Y %H:%M")
should give you that column as a standard R Date-time object for the example data shown. The ones that are in mm/dd/yy hh:mm:ss am/pm format you can convert withas.POSIXct(hobo_dat$DateTime_BST, format = "%m/%d/%y %H:%M:%S %p")
– Allan Cameron Commented Mar 6 at 15:06POSIXt
. – r2evans Commented Mar 6 at 15:060024
instead of2024
. If I read the post correctly, we could pick out the two formats by noting that they have a different number of characters (fixed at 17 versus fixed at 19) – Allan Cameron Commented Mar 6 at 15:08