Re-visit with_tz in lubridate for different timezones

Intro
In my current job, as a (clinical) biostatistician, I still have tons of opportunities to code in R for modelling. There is a chance that I need to convert the UTC time to local time for meaningful analyses. The with_tz function in lubridate will always be the first choice popping out from Google searching or ChatGPT or Copilot [Yes, the world is dramatically changing since my last post].
When applying with_tz function, I realize there might be some mismatches in the output at the certain timezone, after across checking those with my lovely colleagues. I am trying to figure out how this function could work. Here are some toy examples to help the folks and myself when there is a chance to use with_tz function in the future.
Let’s play with some data
adbp <- data.frame(
subjid = c("001", "001", "002", "002"),
bpdttm = as.POSIXct(c("2020-01-18 07:00:00 UTC", "2020-04-18 07:00:00 UTC", "2020-01-18 14:00:00 UTC", "2020-04-18 22:00:00 UTC"), tz='UTC'),
timezone = c("America/Chicago", "America/Chicago", "America/New_York", "America/New_York")
)
adbp## subjid bpdttm timezone
## 1 001 2020-01-18 07:00:00 America/Chicago
## 2 001 2020-04-18 07:00:00 America/Chicago
## 3 002 2020-01-18 14:00:00 America/New_York
## 4 002 2020-04-18 22:00:00 America/New_YorkFirst Try
With new generation training by hardly 😂, dplyr is now to manipulate the data. Additionally, I assume we already googled that with_tz is not Vectorized in all arguments, so rowwise() should be applied.
adbp %>%
rowwise() %>%
mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) %>%
ungroup()## # A tibble: 4 × 4
## subjid bpdttm timezone bpdttm_local
## <chr> <dttm> <chr> <dttm>
## 1 001 2020-01-18 07:00:00 America/Chicago 2020-01-18 01:00:00
## 2 001 2020-04-18 07:00:00 America/Chicago 2020-04-18 02:00:00
## 3 002 2020-01-18 14:00:00 America/New_York 2020-01-18 08:00:00
## 4 002 2020-04-18 22:00:00 America/New_York 2020-04-18 17:00:00It seems perfect. Should this be all right? As mentioned, since my recent work needs to merge the datasets from different resources, the inconsistency of the (converted) local time trigger me to check the output again.
adbp %>%
filter(row_number()==1) %>% # lets force to use the dplyr way
mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) ## subjid bpdttm timezone bpdttm_local
## 1 001 2020-01-18 07:00:00 America/Chicago 2020-01-18 01:00:00adbp %>%
filter(row_number()==2) %>%
mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) ## subjid bpdttm timezone bpdttm_local
## 1 001 2020-04-18 07:00:00 America/Chicago 2020-04-18 02:00:00adbp %>%
filter(row_number()==3) %>%
mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) ## subjid bpdttm timezone bpdttm_local
## 1 002 2020-01-18 14:00:00 America/New_York 2020-01-18 09:00:00adbp %>%
filter(row_number()==4) %>%
mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) ## subjid bpdttm timezone bpdttm_local
## 1 002 2020-04-18 22:00:00 America/New_York 2020-04-18 18:00:00Yes, it is easy to see that the output in the later two rows are not same as the above. There is one hour shift for the New York local time. The later ones are correct.
Try in another direction
It might be my bad habit: my debugging process is always to start from something else. As mentioned, some criticize of the with_tz function is that it is not Vectorized in all arguments. So, I am trying to use the conventional way to convert it to a vectorized function.
Vectorize(anyone can trymapply)
Vwith_tz <- Vectorize(with_tz, vectorize.args = c("time", "tzone"))
adbp %>%
mutate(bpdttm_local = Vwith_tz(bpdttm, tzone = timezone),
bpdttm_local1= as.POSIXct(bpdttm_local)) %>%
ungroup()## subjid bpdttm timezone bpdttm_local bpdttm_local1
## 1 001 2020-01-18 07:00:00 America/Chicago 1579330800 2020-01-18 01:00:00
## 2 001 2020-04-18 07:00:00 America/Chicago 1587193200 2020-04-18 02:00:00
## 3 002 2020-01-18 14:00:00 America/New_York 1579356000 2020-01-18 08:00:00
## 4 002 2020-04-18 22:00:00 America/New_York 1587247200 2020-04-18 17:00:00Vectorize transit the output to a seris of number, and when converted back to POSIXct, it seems to be the same as the first try, meaning that the output is not correct. I am not sure whether it is cuz Vectorize does not work or with_tz has its own magic.
Try divide-and-conquer
I doubt this would be considered as a debugging process. I am thinking and proceeding to borrow the idea of the divide-and-conquer strategy: split the data by timezone and then combine them together. As the first try, if it works for the splitted data, it suppose to be corrected after combining them.
adbp %>%
split(.,.$timezone) %>% # lets force to do the fancy tidyverse way
map(~mutate(.x, bpdttm_local = with_tz(bpdttm, tzone = timezone))) %>%
bind_rows()## subjid bpdttm timezone bpdttm_local
## 1 001 2020-01-18 07:00:00 America/Chicago 2020-01-18 01:00:00
## 2 001 2020-04-18 07:00:00 America/Chicago 2020-04-18 02:00:00
## 3 002 2020-01-18 14:00:00 America/New_York 2020-01-18 08:00:00
## 4 002 2020-04-18 22:00:00 America/New_York 2020-04-18 17:00:00OMG, it is incorrect as well. How dare you doubt the fancy tidyverse way?
Well, let’s try more conventional way: split the data into separate datasets and then combine them
split_dfs <- split(adbp, adbp$timezone)
list2env(split_dfs, envir = .GlobalEnv)## <environment: R_GlobalEnv>`America/Chicago` %>%rowwise()%>% mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone))## # A tibble: 2 × 4
## # Rowwise:
## subjid bpdttm timezone bpdttm_local
## <chr> <dttm> <chr> <dttm>
## 1 001 2020-01-18 07:00:00 America/Chicago 2020-01-18 01:00:00
## 2 001 2020-04-18 07:00:00 America/Chicago 2020-04-18 02:00:00`America/New_York` %>%rowwise()%>% mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone))## # A tibble: 2 × 4
## # Rowwise:
## subjid bpdttm timezone bpdttm_local
## <chr> <dttm> <chr> <dttm>
## 1 002 2020-01-18 14:00:00 America/New_York 2020-01-18 09:00:00
## 2 002 2020-04-18 22:00:00 America/New_York 2020-04-18 18:00:00Oh, do you see any ray of hope? Lets combine them together.
`America/Chicago` %>%rowwise()%>% mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone))%>%
bind_rows(
`America/New_York` %>%rowwise()%>% mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)))## # A tibble: 4 × 4
## # Rowwise:
## subjid bpdttm timezone bpdttm_local
## <chr> <dttm> <chr> <dttm>
## 1 001 2020-01-18 07:00:00 America/Chicago 2020-01-18 01:00:00
## 2 001 2020-04-18 07:00:00 America/Chicago 2020-04-18 02:00:00
## 3 002 2020-01-18 14:00:00 America/New_York 2020-01-18 08:00:00
## 4 002 2020-04-18 22:00:00 America/New_York 2020-04-18 17:00:00Bang! It is still incorrect. However, we may get some clues. Lets freeze the converted local time in the character format and then combine:
`America/Chicago` %>%rowwise()%>%
mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone),
bpdttm_localc = as.character(bpdttm_local))%>%
bind_rows(`America/New_York` %>%rowwise()%>%
mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone),
bpdttm_localc = as.character(bpdttm_local))
)## # A tibble: 4 × 5
## # Rowwise:
## subjid bpdttm timezone bpdttm_local bpdttm_localc
## <chr> <dttm> <chr> <dttm> <chr>
## 1 001 2020-01-18 07:00:00 America/Chicago 2020-01-18 01:00:00 2020-01-18 01…
## 2 001 2020-04-18 07:00:00 America/Chicago 2020-04-18 02:00:00 2020-04-18 02…
## 3 002 2020-01-18 14:00:00 America/New_York 2020-01-18 08:00:00 2020-01-18 09…
## 4 002 2020-04-18 22:00:00 America/New_York 2020-04-18 17:00:00 2020-04-18 18…Yes, the character format output is correct rather than the POSIX format.
Conclusion
According to the experiments above, it seems, if the data include different timezones, with_tz function may not work as expected in the POSIX format. There might be something missing here. The solution so far might be to convert the output to character format or use local_time as others suggested, before we have more chances to dig into the source code of with_tz.