Snippets
Misc
Check whether an environment variable is empty
nzchar(Sys.getenv("blopblopblop")) #> [1] FALSE ::with_envvar( withrnew = c("blopblopblop" = "bla"), nzchar(Sys.getenv("blopblopblop")) )
Use a package for a single instance using {withr::with_package}
- Using library() will keep the package loaded during the whole session, with_package() just runs the code snippet with that package temporarily loaded. This can be useful to avoid namespace collisions for example
Read .csv from a zipped file
# long way <- tempfile() tmpf <- tempfile() tmpd download.file('https://website.org/path/to/file.zip', tmpf) unzip(tmpf, exdir = tmpd) <- data.table::fread(file.path(tmpd, y grep('csv$', unzip(tmpf, list = TRUE)$Name, value = TRUE))) unlink(tmpf) unlink(tmpd) # quick way <- data.table::fread('curl https://website.org/path/to/file.zip | funzip') y
Load all R scripts from a directory:
for (file in list.files("R", full.names = TRUE)) source(file)
View dataframe in View as html table using {kableExtra}
<- kableExtra::kbl(rbind(head(df, 5), tail(df, 5)), format = "html") df_html print(df_html)
Options
{readr}
options(readr.show_col_types = FALSE)
Cleaning
Misc
- Dates should follow YYYY-MM-DD format (ISO 8601 standard)
Remove all objects except:
rm(list=setdiff(ls(), c("train", "validate", "test")))
Remove NAs
dataframes
%>% na.omit df %>% filter(complete.cases(.)) df %>% tidyr::drop_na() df
variables
%>% filter(!is.na(x1)) df %>% tidyr::drop_na(x1) df
Find duplicate rows
{datawizard} - Extract all duplicates, for visual inspection. Note that it also contains the first occurrence of future duplicates, unlike
duplicated
ordplyr::distinct
. Also contains an additional column reporting the number of missing values for that row, to help in the decision-making when selecting which duplicates to keep.<- data.frame( df1 id = c(1, 2, 3, 1, 3), year = c(2022, 2022, 2022, 2022, 2000), item1 = c(NA, 1, 1, 2, 3), item2 = c(NA, 1, 1, 2, 3), item3 = c(NA, 1, 1, 2, 3) ) data_duplicated(df1, select = "id") #> Row id year item1 item2 item3 count_na #> 1 1 1 2022 NA NA NA 3 #> 4 4 1 2022 2 2 2 0 #> 3 3 3 2022 1 1 1 0 #> 5 5 3 2000 3 3 3 0 data_duplicated(df1, select = c("id", "year")) #> 1 1 1 2022 NA NA NA 3 #> 4 4 1 2022 2 2 2 0
{dplyr}
<- dat %>% dups group_by(BookingNumber, BookingDate, Charge) %>% filter(n() > 1)
base r
duplicated(df["ID"], fromLast = F) | duplicated(df["ID"], fromLast = T), ] df[ ## ID value_1 value_2 value_1_2 ## 2 ID-003 6 5 6 5 ## 3 ID-006 1 3 1 3 ## 4 ID-003 1 4 1 4 ## 5 ID-005 5 5 5 5 ## 6 ID-003 2 3 2 3 ## 7 ID-005 2 2 2 2 ## 9 ID-006 7 2 7 2 ## 10 ID-006 2 3 2 3
df[duplicated(df["ID"], fromLast = F)
doesn’t include the first occurence, so also counting from the opposite direction will include all occurences of the duplicated rows
{tidydensity}
<- data.frame( data x = c(1, 2, 3, 1), y = c(2, 3, 4, 2), z = c(3, 2, 5, 3) ) check_duplicate_rows(data) #> [1] FALSE TRUE FALSE FALSE
Remove duplicated rows
{datawizard} - From all rows with at least one duplicated ID, keep only one. Methods for selecting the duplicated row are either the first duplicate, the last duplicate, or the “best” duplicate (default), based on the duplicate with the smallest number of NA. In case of ties, it picks the first duplicate, as it is the one most likely to be valid and authentic, given practice effects.
<- data.frame( df1 id = c(1, 2, 3, 1, 3), item1 = c(NA, 1, 1, 2, 3), item2 = c(NA, 1, 1, 2, 3), item3 = c(NA, 1, 1, 2, 3) ) data_unique(df1, select = "id") #> (2 duplicates removed, with method 'best') #> id item1 item2 item3 #> 1 1 2 2 2 #> 2 2 1 1 1 #> 3 3 1 1 1
base R
!duplicated(df[c("col1")]), ] df[
dplyr
distinct(df, col1, .keep_all = TRUE)
Showing all combinations present in the data and creating all possible combinations
Fuzzy Join (alt to case_when)
<- data.frame( ref.df bucket = c(“High”, “Medium-High”, “Medium-Low”, “Low”), value.high = c(max(USArrests$Assault), 249, 199, 149), value.low = c(250, 200, 150, min(USArrests$Assault))) %>% USArrests fuzzy_join(ref.df, by = c("Assault"="value.low", "Assault" = 'value.high'), match_fun = c(`>=`,`<=`)) %>% select(-c(value.high, value.low))
Remove elements of a list by name
::discard_at(my_list, "a") purrr::list_remove listr
Filter row index before/after a condition
Example 1: Find flights that departed after an “AA” flight departed (article)
|> flights_df select(dep_time, flight, carrier) |> arrange(dep_time) |> slice( unique(sort(c( which(carrier == "AA"), which(carrier == "AA") + 1 ))) )
which(carrier == "AA")
isn’t strictly necessary. It also includes AA flight that is proceded by the flight we’re looking for in case that’s something you also want to look at.- Without
sort
, the output row order will have those strictly unnecessary AA flights first, then the flights we’re interested instead of them being in order of departure time.c
will contanenate the row indexes outputted by thewhich
functions and sort will order them. This will result in the order being by departure time. - There’s a duplicate row, so
unique
gets rid of it.
Example 2: Get countries that are immediately before and after Germany in GDP for each year (article)
<- gapminder_df |> gdp_dat group_by(year) |> arrange(gdpPercap) |> slice(as.integer(outer(-1:1, which(country == "Germany"), `+` ))) |> # for ordering columns in plot mutate(grp = forcats::fct_inorder(c("lo", "is", "hi"))) |> # Ungroup and make ggplot ungroup()
* This assumes Germany will always have a country above and below it in GDP. See article for more robust code *
Instead using the syntax in the previous example where 1 is added to the index,
outer
is used so that you don’t have to repeat a bunch ofwhich
statements.- -1:1 gets the rows before, after, and including the Germany row.
outer
with+
stacks the vectors of indexes from each which statment on top of each other into a matrixas.integer
coerces the matrix into a vector soslice
can filter the indexes.- See R, Base R >> Functions >> outer for a details on the
outer
function
Chart code
Code
ggplot(gdp_dat, aes(as.factor(year), gdpPercap, group = grp)) + geom_col(aes(fill = grp == "is"), position = position_dodge()) + geom_text( aes(label = country_code), vjust = 1.3, position = position_dodge(width = .9) + ) scale_fill_manual( values = c("grey75", "steelblue"), guide = guide_none() + ) theme_classic() + labs(x = "Year", y = "GDP per capita")
Create labelled columns (source)
<- penguins |> penguins_labelled ::set_variable_labels( labelledspecies = "Penguin species", island = "Island in Palmer Archipelago, Antarctica", bill_length_mm = "Bill length (mm)", bill_depth_mm = "Bill depth (mm)", flipper_length_mm = "Flipper length (mm)", body_mass_g = "Body mass (g)", sex = "Penguin sex", year = "Study year" )View(penguins_labelled)
Transformations
- Dummy Encode (article)
Some modeling packages don’t accept factor variables and dummies must be explicitly provided.
dplyr
<- penguins_explicit reduce( levels(penguins$species)[-1], ~ mutate(.x, !!paste0("species", .y) := as.integer(species == .y)), .init = penguins )
.x provides the .init tibble and the successively recursed tibbles
To get a feel for what’s happening, here’s a simple illustration of the tidyevall bang-bang syntax plus walrus operator
<- c("a", "b", "c") new_cols # add 3 cols called a,b,c with NAs %>% mtcars head() %>% select(mpg) %>% mutate(!!new_cols[1] := NA) %>% mutate(!!new_cols[2] := NA) %>% mutate(!!new_cols[3] := NA)
data.table
<- as.data.table(penguins) penguins_dt <- levels(penguins_dt$species)[-1] treatment_lvls <- paste0("species", treatment_lvls) treatment_cols := lapply(treatment_lvls, function(x){as.integer(species == x)})][] penguins_dt[, (treatment_cols)
Functions
Create formula from string
<- 'Days_Attended ~ W + School' analysis_formula <- function(data) lm(as.formula(analysis_formula), data = data) estimator_func
Recursive Function
Example
# Replace pkg text with html <- function(dat, patterns) { replace_txt if (length(patterns) == 0) { return(dat) } <- patterns[[1]]$pattern_str pattern_str <- patterns[[1]]$repl_str repl_str <- dat |> replaced_txt str_replace_all(pattern = pattern_str, repl_str) <- patterns[-1] new_patterns replace_txt(replaced_txt, new_patterns) }
- Arguments include the dataset and the iterable
- Tests whether function has iterated through pattern list
- Removes 1st element of the list
replace_text
calls itself within the function with the new list and new dataset
Example: Using
Recall
andtryCatch
<- function(rd) { load_page_completely # load more content even if it throws an error tryCatch({ # call load_more() load_more(rd) # if no error is thrown, call the load_page_completely() function again Recall(rd) error = function(e) { }, # if an error is thrown return nothing / NULL }) }
load_more
is a user defined functionRecall
is a base R function that calls the same function it’s in.
Error Handling for Internal Functions (source)
library(cli) library(rlang) <- assert3 function(x, y, 1arg = caller_arg(x), 2call_expr = caller_call()) { 3if (!inherits(x, y)) { 4abort(format_error("{.strong {arg}} must be of class {y}"), 5call = call_expr) } } <- function(x, class) { some_fun3 assert3(x, class) } some_fun3(5, "character") #> Error in `some_fun3()`: #> ! x must be of class character #> Run `rlang::last_trace()` to see where the error occurred. last_trace() #> <error/rlang_error> #> Error in `some_fun3()`: #> ! x must be of class character #> --- #> Backtrace: #> ▆ #> 1. └─global some_fun3(5, "character") #> 2. └─global assert3(x, class) #> Run rlang::last_trace(drop = FALSE) to see 1 hidden frame.
- 1
-
rlang::caller_arg
formats the value of an argument (e.g. x) into a string so that can be used in error messages - 2
-
rlang::caller_call
(with default being n = 1) goes 1 level up to get the function that called this function. - 3
-
inherit
is a logical that tests whether x has the class indicated by y. Similar to anis.*
function. - 4
-
cli::.strong
adds bold text styling to the value - 5
-
call in
rlang::abort
specifies the environment to mention in the error message.
- By using
rlang::caller_call
in the error function, the user will know which function in their script failed which is 1 level up from the assert (internal function). - Then
rlang::last_trace
is used to drill down and discover that error was triggered in the internal function (e.g.assert3
)
Calculations
Compute the running maximum per group
<- structure(list( df var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L), group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"), time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)), .Names = c("var", "group","time"), class = "data.frame", row.names = c(NA, -10L)) order(df$group, df$time),] df[# var group time # 1 5 a 1 # 2 2 a 2 # 3 3 a 3 # 4 4 a 4 # 5 0 a 5 # 6 3 b 1 # 7 6 b 2 # 8 4 b 3 # 9 8 b 4 # 10 4 b 5 $curMax <- (dfave(df$var, $group, dfFUN=cummax)) | group | time | curMax var 5 a 1 5 2 a 2 5 3 a 3 5 4 a 4 5 0 a 5 5 3 b 1 3 6 b 2 6 4 b 3 6 8 b 4 8 4 b 5 8
Time Series
Base-R
The previous month and its year (source)
library(lubridate) <- prev_month add_with_rollback(today(), months(-1)) |> month(label = TRUE, abbr = FALSE) <- prev_months_yr add_with_rollback(today(), months(-1)) |> year() # base r format(seq(Sys.Date(), length = 2, by = "-1 month")[2], "%B %Y")
Intervals
Difference between dates
# Sample dates <- as.Date("2022-01-15") start_date <- as.Date("2023-07-20") end_date # Calculate time difference in days <- end_date - start_date time_diff_days # Convert days to months <- as.numeric(time_diff_days) / 30.44 # average days in a month months_diff_base cat("Number of months using base R:", round(months_diff_base, 2), "\n") #> Number of months using base R: 18.1
Moving Windows
- Example (article)
For each 3 day window:
- Find the minimum value
- Find the time of the minimum value
- Find the value of two other columns at that time of the minimum value
Code
<- tibble( (ts_df time = 1:6, val = sample(1:6 * 10), col1 = rnorm(6), col2 = rnorm(6) ))#> # A tibble: 6 × 4 #> time val col1 col2 #> <int> <dbl> <dbl> <dbl> #> 1 1 10 -0.529 0.928 #> 2 2 30 0.874 -0.967 #> 3 3 60 1.50 1.49 #> 4 4 40 -1.14 0.575 #> 5 5 20 0.114 0.307 #> 6 6 50 -0.712 -0.291 <- ts_df |> moving_mins slice( outer(-2:0, row_number(), "+")[,-(1:2)] |> apply(MARGIN = 2L, \(i) i[which.min(val[i])]) |> ) rename_with(~ paste0("min3val_", .x)) |> mutate(time = ts_df$time[-(1:2)]) left_join(ts_df, moving_mins, by = "time") #> time val col1 col2 min3val_time min3val_val min3val_col1 min3val_col2 #> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> #> 1 1 10 -0.529 0.928 NA NA NA NA #> 2 2 30 0.874 -0.967 NA NA NA NA #> 3 3 60 1.50 1.49 1 10 -0.529 0.928 #> 4 4 40 -1.14 0.575 2 30 0.874 -0.967 #> 5 5 20 0.114 0.307 5 20 0.114 0.307 #> 6 6 50 -0.712 -0.291 5 20 0.114 0.307
- See article for in-depth breakdown of what the code is doing.
outer
creates a column-wise lagged matrix- Also see Cleaning >> Filter row index before/after a condition >> Example 2 which gives more detail on what’s happening with
outer
- Also see Cleaning >> Filter row index before/after a condition >> Example 2 which gives more detail on what’s happening with
apply
loops through each column and selects the index of the minimum valueslice
filters the dateset for those indexes- Everything else is just adding columns back from the original df
- Example (article)
{lubridate}
Intervals
Lubridate’s interval functions
Notes from: Wrangling interval data using lubridate
Difference between dates
# Load the lubridate package library(lubridate) # Sample dates <- ymd("2022-01-15") start_date <- ymd("2023-07-20") end_date # Calculate months difference using lubridate <- interval(start_date, end_date) %/% months(1) months_diff_lubridate cat("Number of months using lubridate:", months_diff_lubridate, "\n") #> Number of months using lubridate: 18
%/%
is used for floor division by months. For decimals, just use/
Data
<- tibble( (house_df person_id = factor(c("A10232", "A10232", "A10232", "A39211", "A39211", "A28183", "A28183", "A10124")), house_id = factor(c("H1200E", "H1243D", "H3432B", "HA7382", "H53621", "HC39EF", "HA3A01", "H222BA")), start_date = ymd(c("20200101", "20200112", "20211120", "19800101", "19900101", "20170303", "20190202", "19931023")), end_date = ymd(c("20200112", "20211120", "20230720", "19891231", "20170102", "20180720", "20230720", "20230720")) )) #> A tibble: 8 × 4 #> person_id house_id start_date end_date #> <fct> <fct> <date> <date> #> 1 A10232 H1200E 2020-01-01 2020-01-12 #> 2 A10232 H1243D 2020-01-12 2021-11-20 #> 3 A10232 H3432B 2021-11-20 2023-07-20 #> 4 A39211 HA7382 1980-01-01 1989-12-31 #> 5 A39211 H53621 1990-01-01 2017-01-02 #> 6 A28183 HC39EF 2017-03-03 2018-07-20 #> 7 A28183 HA3A01 2019-02-02 2023-07-20 #> 8 A10124 H222BA 1993-10-23 2023-07-20
Create interval column
<- house_df |> house_df mutate( # create the interval int = interval(start_date, end_date), # drop the start/end columns .keep = "unused" ) house_df#> A tibble: 8 × 3 #> person_id house_id int #> <fct> <fct> <Interval> #> 1 A10232 H1200E 2020-01-01 UTC--2020-01-12 UTC #> 2 A10232 H1243D 2020-01-12 UTC--2021-11-20 UTC #> 3 A10232 H3432B 2021-11-20 UTC--2023-07-20 UTC #> 4 A39211 HA7382 1980-01-01 UTC--1989-12-31 UTC #> 5 A39211 H53621 1990-01-01 UTC--2017-01-02 UTC #> 6 A28183 HC39EF 2017-03-03 UTC--2018-07-20 UTC #> 7 A28183 HA3A01 2019-02-02 UTC--2023-07-20 UTC #> 8 A10124 H222BA 1993-10-23 UTC--2023-07-20 UTC
-
<- function(int, int_limits) { int_intersect int_start(int) <- pmax(int_start(int), int_start(int_limits)) int_end(int) <- pmin(int_end(int), int_end(int_limits)) return(int) }
- The red dashed line is the reference interval and the blue solid line is the interval of interest
- The function creates an interval thats the intersection of both intervals (segment between black parentheses)
Proportion of the Reference Interval
<- function(dat, reference_interval) { int_proportion # start with the housing data |> dat # only retain overlapping rows, this makes the following # operations more efficient by only computing what we need filter(int_overlaps(int, reference_interval)) |> # then, actually compute the overlap of the intervals mutate( # use our earlier truncate function int_sect = int_intersect(int, reference_interval), # then, it's simple to compute the overlap proportion prop = int_length(int_sect) / int_length(reference_interval) |> ) # combine different intervals per person summarize(prop_in_nl = sum(prop), .by = person_id) }
Example
<- interval(ymd("20170101"), ymd("20171231")) int_2017 <- prop_2017 int_proportion(dat = house_df, reference_interval = int_2017) prop_2017 #> # A tibble: 3 × 2 #> person_id prop_in_nl #> <fct> <dbl> #> 1 A39211 0.00275 #> 2 A28183 0.832 #> 3 A10124 1
Parallelization
- Making a cluster out of SSH connected machines (Thread)
Basic
::p_load(parallely, future, furrr) pacman= c("host1", "host2") nodes plan(cluster, workers = nodes) future_map(...)
With {renv}
::p_load(parallely, future, furrr) pacman= c("host1", "host2") nodes plan(cluster, workers = nodes, rscript_libs = .libPaths()) future_map(...)