Re-visit with_tz in lubridate for different timezones

Intro

In my current job, as a (clinical) biostatistician, I still have tons of opportunities to code in R for modelling. There is a chance that I need to convert the UTC time to local time for meaningful analyses. The with_tz function in lubridate will always be the first choice popping out from Google searching or ChatGPT or Copilot [Yes, the world is dramatically changing since my last post].

When applying with_tz function, I realize there might be some mismatches in the output at the certain timezone, after across checking those with my lovely colleagues. I am trying to figure out how this function could work. Here are some toy examples to help the folks and myself when there is a chance to use with_tz function in the future.

Let’s play with some data

adbp <- data.frame(
  subjid = c("001", "001", "002", "002"),
  bpdttm = as.POSIXct(c("2020-01-18 07:00:00 UTC", "2020-04-18 07:00:00 UTC", "2020-01-18 14:00:00 UTC", "2020-04-18 22:00:00 UTC"), tz='UTC'),
  timezone = c("America/Chicago", "America/Chicago", "America/New_York", "America/New_York")
)
adbp
##   subjid              bpdttm         timezone
## 1    001 2020-01-18 07:00:00  America/Chicago
## 2    001 2020-04-18 07:00:00  America/Chicago
## 3    002 2020-01-18 14:00:00 America/New_York
## 4    002 2020-04-18 22:00:00 America/New_York

First Try

With new generation training by hardly 😂, dplyr is now to manipulate the data. Additionally, I assume we already googled that with_tz is not Vectorized in all arguments, so rowwise() should be applied.

adbp %>% 
  rowwise() %>%
  mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) %>%
  ungroup()
## # A tibble: 4 × 4
##   subjid bpdttm              timezone         bpdttm_local       
##   <chr>  <dttm>              <chr>            <dttm>             
## 1 001    2020-01-18 07:00:00 America/Chicago  2020-01-18 01:00:00
## 2 001    2020-04-18 07:00:00 America/Chicago  2020-04-18 02:00:00
## 3 002    2020-01-18 14:00:00 America/New_York 2020-01-18 08:00:00
## 4 002    2020-04-18 22:00:00 America/New_York 2020-04-18 17:00:00

It seems perfect. Should this be all right? As mentioned, since my recent work needs to merge the datasets from different resources, the inconsistency of the (converted) local time trigger me to check the output again.

adbp %>% 
  filter(row_number()==1) %>% # lets force to use the dplyr way
  mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) 
##   subjid              bpdttm        timezone        bpdttm_local
## 1    001 2020-01-18 07:00:00 America/Chicago 2020-01-18 01:00:00
adbp %>% 
  filter(row_number()==2) %>% 
  mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) 
##   subjid              bpdttm        timezone        bpdttm_local
## 1    001 2020-04-18 07:00:00 America/Chicago 2020-04-18 02:00:00
adbp %>% 
  filter(row_number()==3) %>%
  mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) 
##   subjid              bpdttm         timezone        bpdttm_local
## 1    002 2020-01-18 14:00:00 America/New_York 2020-01-18 09:00:00
adbp %>% 
  filter(row_number()==4) %>% 
  mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)) 
##   subjid              bpdttm         timezone        bpdttm_local
## 1    002 2020-04-18 22:00:00 America/New_York 2020-04-18 18:00:00

Yes, it is easy to see that the output in the later two rows are not same as the above. There is one hour shift for the New York local time. The later ones are correct.

Try in another direction

It might be my bad habit: my debugging process is always to start from something else. As mentioned, some criticize of the with_tz function is that it is not Vectorized in all arguments. So, I am trying to use the conventional way to convert it to a vectorized function.

  • Vectorize (anyone can try mapply)
Vwith_tz <- Vectorize(with_tz, vectorize.args = c("time", "tzone"))
adbp %>% 
  mutate(bpdttm_local = Vwith_tz(bpdttm, tzone = timezone),
         bpdttm_local1= as.POSIXct(bpdttm_local)) %>%
  ungroup()
##   subjid              bpdttm         timezone bpdttm_local       bpdttm_local1
## 1    001 2020-01-18 07:00:00  America/Chicago   1579330800 2020-01-18 01:00:00
## 2    001 2020-04-18 07:00:00  America/Chicago   1587193200 2020-04-18 02:00:00
## 3    002 2020-01-18 14:00:00 America/New_York   1579356000 2020-01-18 08:00:00
## 4    002 2020-04-18 22:00:00 America/New_York   1587247200 2020-04-18 17:00:00

Vectorize transit the output to a seris of number, and when converted back to POSIXct, it seems to be the same as the first try, meaning that the output is not correct. I am not sure whether it is cuz Vectorize does not work or with_tz has its own magic.

Try divide-and-conquer

I doubt this would be considered as a debugging process. I am thinking and proceeding to borrow the idea of the divide-and-conquer strategy: split the data by timezone and then combine them together. As the first try, if it works for the splitted data, it suppose to be corrected after combining them.

adbp %>%
  split(.,.$timezone) %>% # lets force to do the fancy tidyverse way
  map(~mutate(.x, bpdttm_local = with_tz(bpdttm, tzone = timezone))) %>%
  bind_rows()
##   subjid              bpdttm         timezone        bpdttm_local
## 1    001 2020-01-18 07:00:00  America/Chicago 2020-01-18 01:00:00
## 2    001 2020-04-18 07:00:00  America/Chicago 2020-04-18 02:00:00
## 3    002 2020-01-18 14:00:00 America/New_York 2020-01-18 08:00:00
## 4    002 2020-04-18 22:00:00 America/New_York 2020-04-18 17:00:00

OMG, it is incorrect as well. How dare you doubt the fancy tidyverse way?

Well, let’s try more conventional way: split the data into separate datasets and then combine them

split_dfs <- split(adbp, adbp$timezone)

list2env(split_dfs, envir = .GlobalEnv)
## <environment: R_GlobalEnv>
`America/Chicago` %>%rowwise()%>% mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone))
## # A tibble: 2 × 4
## # Rowwise: 
##   subjid bpdttm              timezone        bpdttm_local       
##   <chr>  <dttm>              <chr>           <dttm>             
## 1 001    2020-01-18 07:00:00 America/Chicago 2020-01-18 01:00:00
## 2 001    2020-04-18 07:00:00 America/Chicago 2020-04-18 02:00:00
`America/New_York` %>%rowwise()%>% mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone))
## # A tibble: 2 × 4
## # Rowwise: 
##   subjid bpdttm              timezone         bpdttm_local       
##   <chr>  <dttm>              <chr>            <dttm>             
## 1 002    2020-01-18 14:00:00 America/New_York 2020-01-18 09:00:00
## 2 002    2020-04-18 22:00:00 America/New_York 2020-04-18 18:00:00

Oh, do you see any ray of hope? Lets combine them together.

`America/Chicago` %>%rowwise()%>% mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone))%>%
  bind_rows(
`America/New_York` %>%rowwise()%>% mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone)))
## # A tibble: 4 × 4
## # Rowwise: 
##   subjid bpdttm              timezone         bpdttm_local       
##   <chr>  <dttm>              <chr>            <dttm>             
## 1 001    2020-01-18 07:00:00 America/Chicago  2020-01-18 01:00:00
## 2 001    2020-04-18 07:00:00 America/Chicago  2020-04-18 02:00:00
## 3 002    2020-01-18 14:00:00 America/New_York 2020-01-18 08:00:00
## 4 002    2020-04-18 22:00:00 America/New_York 2020-04-18 17:00:00

Bang! It is still incorrect. However, we may get some clues. Lets freeze the converted local time in the character format and then combine:

`America/Chicago` %>%rowwise()%>% 
  mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone),
         bpdttm_localc = as.character(bpdttm_local))%>%
  bind_rows(`America/New_York` %>%rowwise()%>% 
              mutate(bpdttm_local = with_tz(bpdttm, tzone = timezone),
                     bpdttm_localc = as.character(bpdttm_local))
            )
## # A tibble: 4 × 5
## # Rowwise: 
##   subjid bpdttm              timezone         bpdttm_local        bpdttm_localc 
##   <chr>  <dttm>              <chr>            <dttm>              <chr>         
## 1 001    2020-01-18 07:00:00 America/Chicago  2020-01-18 01:00:00 2020-01-18 01…
## 2 001    2020-04-18 07:00:00 America/Chicago  2020-04-18 02:00:00 2020-04-18 02…
## 3 002    2020-01-18 14:00:00 America/New_York 2020-01-18 08:00:00 2020-01-18 09…
## 4 002    2020-04-18 22:00:00 America/New_York 2020-04-18 17:00:00 2020-04-18 18…

Yes, the character format output is correct rather than the POSIX format.

Conclusion

According to the experiments above, it seems, if the data include different timezones, with_tz function may not work as expected in the POSIX format. There might be something missing here. The solution so far might be to convert the output to character format or use local_time as others suggested, before we have more chances to dig into the source code of with_tz.

Meng Xu
Meng Xu
(Bio)Statistician

問渠那得清如許?為有源頭活水來

comments powered by Disqus