converting dates in two formats in a CSV to one standardised format in R

I have some data from five temperature loggers as exported as csvs. Frustratingly it exports the data into a csv with two different date formats either mm/dd/yyyy hh:mm OR mm/dd/yy hh:mm:ss AM/PM in my DateTime_BST column.

I've tried changing the format of the data directly in in the csv but it does not work, it just reverts back to two different states or doesn't change at all. So I thought I could do it in R but I'm not really sure how.

I have the code below which imports all five files into R and combines them into one dataset

hobo_files <- list.files(path = "HOBO_DAT", pattern = "\\.csv$", full.names = TRUE)

hobo_dat <- rbindlist(lapply(hobo_files, fread), use.names = TRUE, fill = TRUE)

str(hobo_dat)

str(hobo_dat)
Classes ‘data.table’ and 'data.frame':  103375 obs. of  6 variables:
 $ hobo_sn     : int  21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 ...
 $ rec_sn      : int  461703 461703 461703 461703 461703 461703 461703 461703 461703 461703 ...
 $ DateTime_BST: chr  "06/09/2024 14:13" "06/09/2024 14:13" "06/09/2024 14:23" "06/09/2024 14:33" ...
 $ Temp        : num  19.7 NA 20.9 21.3 21.6 ...
 $ DateTime_in : chr  "12/06/2024 09:32" "12/06/2024 09:32" "12/06/2024 09:32" "12/06/2024 09:32" ...
 $ DateTime_out: chr  "06/11/2024 16:05" "06/11/2024 16:05" "06/11/2024 16:05" "06/11/2024 16:05" ...

How do I get a standardised date format for my DateTime_BST column?

I have the code below which imports all five files into R and combines them into one dataset

hobo_files <- list.files(path = "HOBO_DAT", pattern = "\\.csv$", full.names = TRUE)

hobo_dat <- rbindlist(lapply(hobo_files, fread), use.names = TRUE, fill = TRUE)

str(hobo_dat)

str(hobo_dat)
Classes ‘data.table’ and 'data.frame':  103375 obs. of  6 variables:
 $ hobo_sn     : int  21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 21863784 ...
 $ rec_sn      : int  461703 461703 461703 461703 461703 461703 461703 461703 461703 461703 ...
 $ DateTime_BST: chr  "06/09/2024 14:13" "06/09/2024 14:13" "06/09/2024 14:23" "06/09/2024 14:33" ...
 $ Temp        : num  19.7 NA 20.9 21.3 21.6 ...
 $ DateTime_in : chr  "12/06/2024 09:32" "12/06/2024 09:32" "12/06/2024 09:32" "12/06/2024 09:32" ...
 $ DateTime_out: chr  "06/11/2024 16:05" "06/11/2024 16:05" "06/11/2024 16:05" "06/11/2024 16:05" ...

How do I get a standardised date format for my DateTime_BST column?

Share Improve this question asked Mar 6 at 14:42 mikejwilliamson 5192 gold badges9 silver badges22 bronze badges

as.POSIXct(hobo_dat$DateTime_BST, format = "%m/%d/%Y %H:%M") should give you that column as a standard R Date-time object for the example data shown. The ones that are in mm/dd/yy hh:mm:ss am/pm format you can convert with as.POSIXct(hobo_dat$DateTime_BST, format = "%m/%d/%y %H:%M:%S %p") – Allan Cameron Commented Mar 6 at 15:06
See stackoverflow/a/70304571/3358272 for "candidate formats", a loop that uses multiple candidate formats for converting to POSIXt. – r2evans Commented Mar 6 at 15:06
2 @r2evans though we need to be careful with mm/dd/yy formats - some of them might convert "successfully" with the year eg 0024 instead of 2024. If I read the post correctly, we could pick out the two formats by noting that they have a different number of characters (fixed at 17 versus fixed at 19) – Allan Cameron Commented Mar 6 at 15:08

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

Revision of my previous answer that addresses y2k ambiguity.

psxfun <- function(x) {
  candidates <- c("%d/%m/%y %H:%M:%S", "%d/%m/%y %H:%M", "%Y-%m-%d %H:%M:%S", "%Y-%m-%d %H:%M", "%d/%m/%Y %H:%M:%S", "%d/%m/%Y %H:%M")
  out <- rep(.POSIXct(NA), length = length(x))
  for (fmt in candidates) {
    if (!length(isna <- is.na(out))) break
    out[isna] <- as.POSIXct(x[isna], format = fmt)
  }
  out
}

The order of formats in candidate_formats is important; for instance, %y will not match 2024 but %Y will match 24 and parse not how we intend here. Also, aside from the 2y/4y thing, it is slightly more efficient to have more-likely candidates first, as it will reduce repeats in the loop and calls to as.POSIXct; this is mostly a theoretical preference until you're dealing with millions of rows, where the extra calls might be palpable (though still manageable).

Sample data, where the last column has some 2y/4y ambiguity (mixed, even):

hobo_dat <- data.table::as.data.table(structure(list(hobo_sn = c(21863784L, 21863784L, 21863784L, 21863784L, 21863784L), rec_sn = c(461703L, 461703L, 461703L, 461703L, 461703L), DateTime_BST = c("06/09/2024 14:13", "06/09/2024 14:13", "06/09/2024 14:23", "06/09/2024 14:33", NA), Temp = c(19.7, NA, 20.9, 21.3, 21.6), DateTime_in = c("12/06/2024 09:32", "12/06/2024 09:32", "12/06/2024 09:32", "12/06/2024 09:32", NA), DateTime_out = c("06/11/24 16:05", "06/11/24 16:05", "06/11/2024 16:05", "06/11/2024 16:05", NA)), row.names = c(NA, -5L ), class = c("data.table", "data.frame")))
hobo_dat
#     hobo_sn rec_sn     DateTime_BST  Temp      DateTime_in     DateTime_out
#       <int>  <int>           <char> <num>           <char>           <char>
# 1: 21863784 461703 06/09/2024 14:13  19.7 12/06/2024 09:32   06/11/24 16:05
# 2: 21863784 461703 06/09/2024 14:13    NA 12/06/2024 09:32   06/11/24 16:05
# 3: 21863784 461703 06/09/2024 14:23  20.9 12/06/2024 09:32 06/11/2024 16:05
# 4: 21863784 461703 06/09/2024 14:33  21.3 12/06/2024 09:32 06/11/2024 16:05
# 5: 21863784 461703             <NA>  21.6             <NA>             <NA>

Each column, manually:

psxfun(hobo_dat$DateTime_BST)
# [1] "2024-09-06 14:13:00 EDT" "2024-09-06 14:13:00 EDT" "2024-09-06 14:23:00 EDT" "2024-09-06 14:33:00 EDT" NA                       
psxfun(hobo_dat$DateTime_in)
# [1] "2024-06-12 09:32:00 EDT" "2024-06-12 09:32:00 EDT" "2024-06-12 09:32:00 EDT" "2024-06-12 09:32:00 EDT" NA                       
psxfun(hobo_dat$DateTime_out)
# [1] "2024-11-06 16:05:00 EST" "2024-11-06 16:05:00 EST" "2024-11-06 16:05:00 EST" "2024-11-06 16:05:00 EST" NA

Since this is a data.table, assuming 1.16.0 or newer (for the names(.SD) := part):

hobo_dat[, names(.SD) := lapply(.SD, psxfun), .SDcols = c("DateTime_BST", "DateTime_in", "DateTime_out")]
#     hobo_sn rec_sn        DateTime_BST  Temp         DateTime_in        DateTime_out
#       <int>  <int>              <POSc> <num>              <POSc>              <POSc>
# 1: 21863784 461703 2024-09-06 14:13:00  19.7 2024-06-12 09:32:00 2024-11-06 16:05:00
# 2: 21863784 461703 2024-09-06 14:13:00    NA 2024-06-12 09:32:00 2024-11-06 16:05:00
# 3: 21863784 461703 2024-09-06 14:23:00  20.9 2024-06-12 09:32:00 2024-11-06 16:05:00
# 4: 21863784 461703 2024-09-06 14:33:00  21.3 2024-06-12 09:32:00 2024-11-06 16:05:00
# 5: 21863784 461703                <NA>  21.6                <NA>                <NA>

though it can be done just as easily with either of

## base R, even on a `data.table` object
for (nm in c("DateTime_BST", "DateTime_in", "DateTime_out")) hobo_dat[[nm]] <- psxfun(hobo_dat[[nm]])
## dplyr
library(dplyr)
hobo_dat <- mutate(hobo_dat, across(c(DateTime_BST, DateTime_in, DateTime_out), ~ psxfun(.x)))

I ended using an ifelse function to detect where rows contained an AM/PM and then specifying format using that, which works pretty well I think.

hobo_dat$DateTime_BST_cor <- ifelse(
  grepl("AM|PM", hobo_dat$DateTime_BST),  # If the datetime string contains AM/PM
  as.POSIXct(hobo_dat$DateTime_BST, format = "%m/%d/%y %I:%M:%S %p"),  # Parse with 12-hour format (AM/PM)
  as.POSIXct(hobo_dat$DateTime_BST, format = "%m/%d/%Y %H:%M")          # Parse with 24-hour format
)

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744968879a4603846.html

converting dates in two formats in a CSV to one standardised format in R - Stack Overflow

2 Answers 2

发表回复

评论列表（0条）

联系我们

400-800-8888

converting dates in two formats in a CSV to one standardised format in R - Stack Overflow

2 Answers 2

相关推荐