Base R

Misc

  • Magrittr + base

    mtcars %>% {plot(.$hp, .$mpg)}
    mtcars %$% plot(hp, mpg)
    • By wrapping the RHS in curly braces, we can override the rule where the LHS is passed to the first argument ## Options {#sec-r-baser-opts .unnumbered}
  • Remove scientific notation

    options(scipen = 999)
  • Wide and long printing tibbles

    # in .Rprofile
    makeActiveBinding(".wide", function() { print(.Last.value, width = Inf) }, .GlobalEnv)
    • After printing a tibble, if you want to see it in wide, then just type .wide + ENTER.
    • Can have similar bindings for `.long` and `.full`.
  • Heredocs - Powerful feature in various programming languages that allow you to define a block of text within the code, preserving line breaks, indentation, and other whitespace.

    text <- r"(
    This is a
    multiline string
    in R)"
    
    cat(text)
  • File Paths

    • basename, dirname, tools::file_path_sans_ext

      basename("~/me/path/to/file.R")
      #> [1] "file.R"
      basename("C:\\Users\\me\\Documents\\R\\Projects\\file.R")
      #> [1] "file.R"
      dirname("~/me/path/to/file.R")
      #> [1] "~me/path/to"
      tools::file_path_sans_ext("~/me/path/to/file.R")
      #> [1] "~/me/path/to/file"
  • Selective function importing (R \(\ge\) 4.0) (source)

    library(stringr, include.only = c("str_extract", "str_remove"))

User Defined Functions

  • Anonymous (aka lambda) functions: \(x) {} (> R 4.1)

    function(x) {
      x[which.max(x$mpg), ]
    }
    # equivalent to the above
    \(x) {
      x[which.max(x$mpg), ]
    }
  • Define and call an anonymous function at the same time

    n <- c(1:10)
    moose <- (\(x) x+1)(n)
    moose
    #> [1]  2  3  4  5  6  7  8  9 10 11
  • Dots (…)

    • Misc

      • {ellipsis}: Functions for testing functions with dots so they fail loudly
      • {rlang} dynamic dots: article
        • Splice arguments saved in a list with the splice operator, !!! .
        • Inject names with glue syntax on the left-hand side of := .
    • User Defined Functions

      moose <- function(...) {
          dots <- list(...)
          dots_names <- names(dots)
          if (is.null(dots_names) || "" %in% dots_names {
              stop("All arguments must be named")
          }
      }
    • Nested Functions

      f02 <- function(...){
        vv <- list(...)
        print(vv)
      }
      f01 <- function(...){
        f02(b = 2,...)
      }
      
      f01(a=1,c=3)
      #> $b
      #> [1] 2
      #> 
      #> $a
      #> [1] 1
      #> 
      #> $c
      #> [1] 3
    • Subset dots values

      add2 <- function(...) {
          ..1 + ..2
      }
      add2(3, 0.14)
      # 3.14
    • Subset dots dynamically: ...elt(n)

      • Set a value to n and get back the value of that argument
    • Number of arguments in … : ...length()

Functions

  • do.call - allows you to call other functions by constructing the function call as a list

    • Args

      • what – Either a function or a non-empty character string naming the function to be called
      • args – A list of arguments to the function call. The names attribute of args gives the argument names
      • quote – A logical value indicating whether to quote the arguments
      • envir – An environment within which to evaluate the call. This will be most useful if what is a character string and the arguments are symbols or quoted expressions
    • Example: Apply function to list of vectors

      vectors <- list(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
      combined_matrix <- do.call(rbind, vectors)
      
      combined_matrix
      ##      [,1] [,2] [,3]
      ## [1,]    1    2    3
      ## [2,]    4    5    6
      ## [3,]    7    8    9
    • Example: Apply multiple functions

      data_frames <- list(
        data.frame(a = 1:3), 
        data.frame(a = 4:6), 
        data.frame(a = 7:9)
        )
      mean_results <- do.call(
        rbind, 
        lapply(data_frames, function(df) mean(df$a))
        )
      
      mean_results
      ##      [,1]
      ## [1,]    2
      ## [2,]    5
      ## [3,]    8
      • First the mean is calculated for column a of each df using lapply
        • lapply is supplying the data for do.call in the required format, which is a list or character vector.
      • Second the results are combined into a matrix with rbind
  • sink - used to divert R output to an external connection.

    • Use Cases: exporting data to a file, logging R output, or debugging R code.

    • Args

      • file: The name of the file to which R output will be diverted. If file is NULL, then R output will be diverted to the console.
      • append: A logical value indicating whether R output should be appended to the file (TRUE) or overwritten (FALSE). The default value is FALSE.
      • type: A character string. Either the output stream or the messages stream. The name will be partially match so can be abbreviated.
      • split: logical: if TRUE, output will be sent to the new sink and the current output stream, like the Unix program tee.
    • Example: Logging output of code to file

      sink("r_output.log")      # Redirect output to this file
      # Your R code goes here
      sink()                    # Turn off redirection
      • output file could also have an extension like “.txt”
    • Example: Debugging

      sink("my_function.log")   # Redirect output to this file
      my_function()
      sink()                    # Turn off redirection
    • Example: Appending output to a file

      sink("output.txt", append = TRUE)  # Append output to the existing file
      cat("Additional text\n")  # Append custom text
      plain text
      sink()  # Turn off redirection
  • pmin and pmax

    • Find the element-wise maximum and minimum values across vectors in R

    • Example

      vec1 <- c(3, 9, 2, 6)
      vec2 <- c(7, 1, 8, 4)
      pmax(vec1, vec2)
      #> [1] 7 9 8 6
      pmin(vec1, vec2)
      #> [1] 3 1 2 4
    • Example: With NAs

      data1 <- c(7, 3, NA, 12)
      data2 <- c(9, NA, 5, 8)
      pmax(data1, data2, na.rm = TRUE)
      #> [1] 9 3 5 12
  • switch

    • Example: (source)

      switch(parallel,
             windows = "snow" -> para_proc,
             other = "multicore" -> para_proc,
             no = "no" -> para_proc,
             stop(sprintf("%s is not one of the 3 possible parallel argument values. See documentation.", parallel)))
      • parallel is the function argument. If it doesn’t match one of the 3 values (windows, other, or no), then an error is thrown.

      • If the argument value is matched, then the quoted value is stored in para_proc

  • dynGet

    • Looks for objects in the environment of a function.

    • When an object from the outer function is an input for a function nested around 3 layers deep or more, it may not be found by that most inner function. dynGet allows that function to find the object in the outer frame

    • Arguments

      • minframe: Integer specifying the minimal frame number to look into (i.e. how far back to look for the object)

      • inherits: Should the enclosing frames of the environment be searched?

    • Example:

      1function(args) {
        if (method == "kj") {
            ncv_list <- purrr::map2(grid$dat, 
                                    grid$repeats, 
                                    function(dat, reps) {
               rsample::nested_cv(dat,
                                  outside = vfold_cv(v = 10, 
                                                     repeats = dynGet("reps")),
                                  inside = bootstraps(times = 25))
            })
        }
      }
      
      2function(data) {
          if (chk::vld_used(...)) {
              dots <- list(...)
              init_boot_args <-
                list(data = dynGet("data"),
                     stat_fun = cles_boot, # internal function
                     group_variables = group_variables,
                     paired = paired)
              get_boot_args <-
                append(init_boot_args,
                       dots)
          }
          cles_booted <-
            do.call(
              get_boot_ci,
              get_boot_args
            )
      }
      1
      Example from Nested Cross-Validation Comparison
      2
      Example from {ebtools::cles}
  • match.arg

    • Partially matches a function’s argument values to list of choices. If the value doesn’t match the choices, then an error is thrown

    • Example:

      keep_input <- "input_le"
      keep_input_val <- 
        match.arg(keep_input,
                  choices = c("input_lags",
                              "input_leads",
                              "both"),
                  several.ok = FALSE)
      keep_input_val
      #> [1] "input_leads"
      • several.ok = FALSE says only 1 match is allowed otherwise an error is thrown.
      • The error message is pretty informative btw.
  • match.fun

    • Example

      f <- function(a,b) {
        a + b
      }
      g <- function(a,b,c) {
        (a + b) * c
      }
      h <- function(d,e) {
        d - e
      }
      yolo <- function(FUN, ...) {
        FUN <- match.fun(FUN)
        params <- list(...)
        FUN_formals <- formals(FUN)
        idx <- names(params) %in% names(FUN)
        do.call(FUN, params[idx])
      }
      yolo(h, d = 2, e = 3)
      #> -1
  • outer

    • Stands for “outer product” which is it’s default, but it doesn’t have to be a product. I can be any operation

    • Basic

      outer(X = c(0, 5), Y = 1:5) # FUN = "*" by default
      #>      [,1] [,2] [,3] [,4] [,5]
      #> [1,]    0    0    0    0    0
      #> [2,]    5   10   15   20   25
      
      outer(c(0, 5), 1:5, FUN = "+")
      #>      [,1] [,2] [,3] [,4] [,5]
      #> [1,]    1    2    3    4    5
      #> [2,]    6    7    8    9   10
      • A row is the result of applying the function to an element of X and each element of Y

        • e.g. The first row the outer product is 0 * 1, 0 * 2, 0 * 3, 0 * 4, 0 * 5.
    • Lags

      dat <- data.frame(x = seq(1,5))
      outer(-1:1, dat[which(dat$x == 3), ], `+`)
      #>      [,1]
      #> [1,]    2
      #> [2,]    3
      #> [3,]    4
      
      outer(-1:1, 
            which(country == "Germany"), 
            `+` ))
      #> [1]  2 14 19 25 26 28 33 38 39 62 65 66 92 93
      #> [1]  3 15 20 26 27 29 34 39 40 63 66 67 93 94
      #> [1]  4 16 21 27 28 30 35 40 41 64 67 68 94 95
      • In the first situation, the first row is -1 plus the index for x = 3 (which is 3), so this give the index before the index of x = 3

      • In the second second situation, the first row is -1 plus the index for each instance of country == “Germany”, so it’s the previous row index.

        • This second situation is something I plucked out of the middle of a chain of code, so it’s doesn’t make exact sense on it’s own. (See R, Snippets >> Cleaning >> Filter row index before/after a condition >> Example 2 for the complete code)
      • Each column of the matrix is the indexes for the before, actual, and after values for each instance of country == “Germany”.

      • This example doesn’t involve time, but you can see how each column could be a window of t-1, t, and t+1. Then window calculations can be done column-wise which is R’s forte. (See R, Snippets >> Calculations >> Time Series >> Base R >>Moving Windows for an example)

  • ave

    • Sort of group_by + mutate (source)

      x <- c(1, 2, 3, 1, 2, 3, 3, 3)
      y <- c(1, 1, 1, 1, 2, 2, 2, 2)
      
      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      # 1. Sum over `x` using `y` as a 'grouping' variable
      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ave(x, y, FUN = sum)
      #> [1] 7 7 7 7 11 11 11 11
      
      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      # 2. Number of distinct elements in each group
      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ave(x, y, FUN = dplyr::n_distinct)
      #> [1] 3 3 3 3 2 2 2 2
      
      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      # 3. Number of times each element of `x` appears
      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ave(x, x, FUN = length)
      #> [1] 2 2 4 2 2 4 4 4
      
      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      # 4. Indicating how many times each element has occurred so far
      # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      x <- c("a", "a", "b", "a", "b", "c")
      paste0(x, ave(x, x, FUN = seq_along))
      #> [1] "a1" "a2" "b1" "a3" "b2" "c1"
      1. Group by “1” and sum. So all the x values that have the same index as y = 1 sum to: 7 = 1 + 2 + 3 + 1. For each y = 1, that calculation is made, so there are 4 sevens.
      2. Group by “1” and find the number of distinct values. There are 3 distinct values of x when y = 1: 1, 2, and 3.
      3. x is used for each input. It counts how many times the value appears in the vector. 1 occurs twice, 2 occurs twice, and 3 occurs four times.
      4. Instead of a total like in the previous example, seq_along creates a cumulative count for each value.
  • Negate

    • Creates a negation of a given function. Can replace ! for easier recognition.

    • Example: Removing NULLs from a list (source)

      Filter(
            Negate(is.null),
            list(...)
      )

Pipe

  • Benefits of base pipe

    • Magrittr pipe is bloated with special features which may make it slower than the base pipe
    • If not using tidyverse, it’s one less dependency (maybe one day it will be deprecated in tidyverse)
  • Base pipe with base and anonymous functions

    # verbosely
    mtcars |> (function(.) plot(.$hp, .$mpg))()
    # using the anonymous function shortcut, emulating the dot syntax
    mtcars |> (\(.) plot(.$hp, .$mpg))()
    # or if you prefer x to .
    mtcars |> (\(x) plot(x$hp, x$mpg))()
    # or if you prefer to be explicit with argument names
    mtcars |> (\(data) plot(data$hp, data$mpg))()
  • Using “_” placeholder:

    • mtcars |> lm(mpg ~ disp, data = _)
    • mtcars |> lm(mpg ~ disp, data = _) |> _$coef
  • Base pipe .[ ]  hack

    wiki |>
      read_html() |>
      html_nodes("table") |>
      (\(.) .[[2]])() |>
      html_table(fill = TRUE) |>
      clean_names()
    # instead of
    djia <- wiki %>%
      read_html() %>%
      html_nodes("table") %>%
      .[[2]] %>%
      html_table(fill = TRUE) %>%
      clean_names()
  • Magrittr, base pipe differences

    • magrittr: %>% allows you change the placement with a . placeholder.
      • base: R 4.2.0 added a _ placeholder to the base pipe, with one additional restriction: the argument has to be named
    • magrittr: With %>% you can use . on the left-hand side of operators like “\(", \[\[, \[ and use in multiple arguments (e.g. df %>% {split(.\)x, .$y))
      • base: can hack this by using anonymous function
        • see Base pipe with base and anonymous functions above
        • see Base pipe .[ ]  hack above
    • magrittr: %>% allows you to drop the parentheses when calling a function with no other arguments (e.g. dat %>% distinct)
      • base: |> always requires the parentheses. (e.g. dat |> distinct())
    • magrittr: %>% allows you to start a pipe with . to create a function rather than immediately executing the pipe
  • Purrr with base pipe

    data_list |>
      map(\(x) clean_names(x))
    # instead of
    data_list %>%
      map( ~.x %>% clean_names)
    # with split
    star |>
      split(~variable) |>
      map_df(\(.) hedg_g(., reading ~ value), .id = "variable")
    # instead of
    star %>%
      split(.$variable) %>%
      map_df(. %>% hedg_g(., reading ~ value), .id = "variable")

Strings

  • Resources

    • From base R - base R equivalents to {stringr} functions
  • sprintf

    x <- 123.456               # Create example data
    
    sprintf("%f", x)           # sprintf with default specification
    #> [1] "123.456000"
    
    sprintf("%.10f", x)        # sprintf with ten decimal places
    #> [1] "123.4560000000"
    
    sprintf("%.2f", x)         # sprintf with two rounded decimal places
    #> [1] "123.46"
    
    sprintf("%1.0f", x)        # sprintf without decimal places
    #> [1] "123"
    
    sprintf("%10.0f", x)       # sprintf with space before number
    #> [1] "       123"
    
    sprintf("%10.1f", x)       # Space before number & decimal places
    #> [1] "     123.5"
    
    sprintf("%-15f", x)        # Space on right side
    #> [1] "123.456000     "
    
    sprintf("%+f", x)          # Print plus sign before number
    #> [1] "+123.456000"
    
    sprintf("%e", x)           # Exponential notation
    #> [1] "1.234560e+02"
    
    sprintf("%E", x)           # Exponential with upper case E
    #> [1] "1.234560E+02"
    
    sprintf("%g", x)           # sprintf without decimal zeros
    #> [1] "123.456"
    
    sprintf("%g", 1e10 * x)    # Scientific notation
    #> [1] "1.23456e+12"
    
    sprintf("%.13g", 1e10 * x) # Fixed decimal zeros
    #> [1] "1234560000000"
    
    paste0(sprintf("%f", x),   # Print %-sign at the end of number
           "%")
    #> [1] "123.456000%"
    
    sprintf("Let's create %1.0f more complex example %1.0f you.", 1, 4)
    #> [1] "Let's create 1 more complex example 4 you."
    • Example: Add leading zeros (source)

      numbers <- c(1, 23, 456)
      formatted_numbers <- sprintf("%05d", numbers)
      print(formatted_numbers)
      #> [1] "00001" "00023" "00456"
      
      mixed_input <- c(12, "abc", 345)
      formatted_mixed_input <- ifelse(
        is.na(as.numeric(mixed_input)), 
        mixed_input, 
        sprintf("%05d", as.numeric(mixed_input))
        )
      print(formatted_mixed_input)
      #> [1] "00012" "abc"   "00345"
  • str2lang - Allows you to turn plain text into code.

    growth_rate <- "circumference / age"
    class(str2lang(growth_rate))
    #> [1] "call"
    • Example: Basic

      eval(str2lang("2 + 2"))
      #> [1] 4
      
      eval(str2lang("x <- 3"))
      x
      #> [1] 3
    • Example: Run formula against a df

      growth_rate <- "circumference / age"
      with(Orange, eval(str2lang(growth_rate)))
      
      #>   [1] 0.25423729 0.11983471 0.13102410 0.11454183 0.09748172 0.10349854
      #>   [7] 0.09165613 0.27966102 0.14256198 0.16716867 0.15537849 0.13972380
      #>  [13] 0.14795918 0.12831858 0.25423729 0.10537190 0.11295181 0.10756972
      #>  [19] 0.09341998 0.10131195 0.08849558 0.27118644 0.12809917 0.16867470
      #>  [25] 0.16633466 0.14541024 0.15233236 0.13527181 0.25423729 0.10123967
      #>  [31] 0.12198795 0.12450199 0.11535337 0.12682216 0.11188369
  • agrep - Fuzzy Matching (source)

    • Syntax

      agrep(
        pattern, 
        x, 
        max.distance = 0.1, 
        ignore.case = FALSE, 
        value = FALSE, 
        fixed = TRUE
        )
      • pattern: The string pattern you want to match
      • x: The vector of strings to search within
      • max.distance: The maximum allowed distance for a match
        • If integer, then the integer is the number of characters that can be wrong
        • If decimal, (0,1), then it’s the percentage of characters than can be wrong
      • ignore.case: Whether to ignore case when matching
      • value: Whether to return the matched values instead of indices
      • fixed: Whether to treat the pattern as a fixed string or a regular expression
    • Examples

      1agrep("aardvark",
            c(" 1 ardvarc 2", "1 lasy 2"), 
            max.distance = 2) 
      #> [1] 1
      
      2agrep("aardvark",
            c(" 1 arbvarc 2", "1 lasy 2"), 
            max.distance = 2) 
      #> integer(0)
      
      3agrep("lasy",
            c(" 1 lazy 2", "1 lasy 2"), 
            max.distance = 0.25) 
      #> [1] 1 2
      1
      The first string is missing an “a” and has a “c” where the “k” should be. Therefore, it’s within 2 characters of being correct which equals max.distance, so it gets matched. The output is 1 which indicates the first string.
      2
      The first string is missing an “a”, has a “c” where the “k” should be, and has a “b” where a “d” should be. Therefore, it’s 3 characters away from being correct which is outside the max.distance. String 2 isn’t close, so nothing gets matched. The output is 0 which indicates the none of the vector indexes has a match.
      3
      The first string has a “z” where the “s” should be and the second string matches exactly. The max.distance is a quarter of the pattern which is 1 character. Therefore, both strings get matched.

Conditionals

  • && and || are intended for use solely with scalars, they return a single logical value.

    • Since they always return a scalar logical, you should use && and || in your if/while conditional expressions (when needed). If an & or | is used, you may end up with a non-scalar vector inside if (…) {} and R will throw an error.
  • & and | work with multivalued vectors, they return a vector whose length matches their input arguments.

  • Alternative way of negating a condition or set of conditions: if (!(condition))

    • Makes it less readable IMO, but maybe for a complicated set of conditions if makes more sense in your head to do it this way

    • Example

      if (!(nr == nrow(iris) || (nr == nrow(iris) - 2))) {print("moose")}
  • Using else if

    if (condition1) {
      expr1
    } else if (condition2) {
      expr2
    } else {
      expr3
    }
  • stopifnot

    pred_fn <- function(steps_forward, newdata) {
      stopifnot(steps_forward >= 1)
      stopifnot(nrow(newdata) == 1)
      model_f = model_map[[steps_forward]]
      # apply the model to the last "before the test period" row to get
      # the k-steps_forward prediction
      as.numeric(predict(model_f, newdata = newdata))
    }
  • %||%

    • Collapse operator which acts like:

      `%||%` <- function(x, y) {
         if (is_null(x)) y else x
      }
      • Says if the first (left-hand) input x is NULL, return y. If x is not NULL, return the input
    • Use Cases

      • Determine whether a function argument is NULL

        github_remote <- 
          function(repo, username = NULL, ...) {
            meta <- parse_git_repo(repo)
            meta$username <- username %||%
              getOption("github.user") %||%
              stop("Unknown username")
          }
      • Within the print argument collapse

        library(rlang)
        
        add_commas <- function(x) {
          if (length(x) <= 1) {
            collapse_arg <- NULL
          } else {
            collapse_arg <- ", "
          }
          print(paste0(x, collapse = collapse_arg %||% ""))
        }
        
        add_commas(c("apples"))
        [1] "apples"
        add_commas(c("apples", "bananas"))
        [1] "apples, bananas"

Loops

  • Misc

    • next - Proceeds to the next iteration of the loop
    • break - Stops the loop
  • For

    • Example: List or vector (source)

      numbers_list <- list(1, 2, 3, 4, 5)
      squared_numbers <- vector("list", length(numbers_list))
      
      for (i in seq_along(numbers_list)) {
        squared_numbers[[i]] <- numbers_list[[i]]^2
      }
      
      print(squared_numbers)
    • Example: Number Range and Nested

      # 2x3
      x <- matrix(1:6, 2, 3)
      
      for (i in seq_len(nrow(x))) {
        for(j in seq_len(ncol(x))) {
          print(x[i, j])
        }
      }
  • While

    • Example (source)

      numbers_list <- list(2, 4, 6, 8, 10, 12, 14)
      index <- 1
      
      while (numbers_list[[index]] <= 10) {
        index <- index + 1
      }
      
      print(paste("The first number greater than 10 is:", numbers_list[[index]]))
  • tapply

    tapply(mtcars$mpg, 
           mtcars$cyl, 
           FUN = mean)
    
    #>         4        6        8 
    #>  26.66364 19.74286 15.10000

Ordering Columns and Sorting Rows

  • Ascending:

    df[with(df, order(col2)), ]
    # or
    df[order(df$col2), ]
    # or
    sort_by(df, df$col2)
    • col2 is the column used to sort the df by
  • Descending: df[with(df, order(-col2)), ]

  • By Multiple Columns

    • Sequentially: sort_by(df, df$col1, df$col2, df$col3)
      • Says sort by col1 then col2 then col3
    • Descending then Ascending: df[with(df, order(-col2, id)), ]
  • Change position of columns

    # Reorder column by index manually
    df2 <- df[, c(5, 1, 4, 2, 3)]
    df3 <- df[, c(1, 5, 2:4)]
    # Reorder column by name manually
    new_order = c("emp_id","name","superior_emp_id","dept_id","dept_branch_id")
    df2 <- df[, new_order]

Set Operations

  • Unique values in A that are not in B

    a <- c("thing", "object")
    b <- c("thing", "gift")
    
    unique(a[!(a %in% b)])
    #> [1] "object"
    
    setdiff(a, b)
    • setdiff is slower
  • Unique values of the two vectors combined

    unique(c(a, b))
    #> [1] "thing"  "object" "gift"
    
    union(a, b)
    • union is just a wrapper for unique

Subsetting

  • Lists and Vectors

    • Removing Rows

      # Remove specific value from vector
      x[!x == 'A']
      
      # Remove multiple values by list
      x[!x %in% c('A', 'D', 'E')]
      
      # Using setdiff
      setdiff(x, c('A','D','E'))
      
      # Remove elements by index
      x[-c(1,2,5)]
      
      # Using which
      x[-which(x %in% c('D','E') )]
      
      # Remove elements by name
      x <- c(C1='A',C2='B',C3='C',C4='E',C5='G')
      x[!names(x) %in% c('C1','C2')]
  • Dataframes

    • Remove specific Rows

      df <- df[-c(25, 3, 62), ]
    • Remove column by name

      df <- df[, which(names(df) == "col_name")]
      df <- subset(df, select = -c(col_name))
      df <- df[, !names(df) %in% c("col1", "col2"), drop = FALSE]
    • Filter and Select

      df <- subset(df, subset = col1 > 56, select = c(col2, col3))

Joins

  • Inner join: inner <- merge(flights, weather, by = mergeCols)
  • Left (outer) join: left <- merge(flights, weather, by = mergeCols, all.x = TRUE)
  • Right (outer) join: right <- merge(flights, weather, by = mergeCols, all.y = TRUE)
  • Full (outer) join: full <- merge(flights, weather, by = mergeCols, all = TRUE)
  • Cross Join (Cartesian product): cross <- merge(flights, weather, by = NULL)
  • Natural join: natural <- merge(flights, weather)

Error Handling

  • stop

    • Example:

      switch(parallel,
             windows = "snow" -> para_proc,
             other = "multicore" -> para_proc,
             no = "no" -> para_proc,
             stop(sprintf("%s is not one of the 3 possible parallel argument values. See documentation.", parallel)))
      • parallel is the function argument. If it doesn’t match one of the 3 values, then an error is thrown.
  • try

    • If something errors, then do something else

    • Example

      current <- try(remDr$findElement(using = "xpath",
                                      '//*[contains(concat( " ", @class, " " ),
                                      concat( " ", "product-price-value", " " ))]'),
                                      silent = T)
      #If error : current price is NA
      if(class(current) =='try-error'){
          currentp[i] <- NA
      } else {
          # do stuff
      }
  • tryCatch

    • Run the main code, but if it “catches” an error, then the secondary code (the workaround) will run.

    • Example: Basic

      tryCatch(
        # try to evaluate the expression
        {
          log("a")
        },
      
        # what happens if there's a warning?
        warning = function(w) {
          print("There was a warning. Here's the message:")
          print(w)
        },
      
        # what happens if there's an error?
        error = function(e) {
          print("There was an error. Returning `NA`.")
          return(NA)
        }
      )
    • Example: Basic in a for-loop

      for (r in 1:nrow(res)) {
        cat(r, "\n")
      
        tmp_wikitext <- get_wikitext(res$film[r], res$year[r])
      
        # skip if get_wikitext fails
        if (is.na(tmp_wikitext)) next
        if (length(tmp_wikitext) == 0) next
      
        # give the text to openai
        tmp_chat <- tryCatch(
          get_results(client, tmp_wikitext),
          error = \(x) NA
        )
      
        # if openai returned a dict of 2
        if (length(tmp_chat) == 2) {
          res$writer[r] <- tmp_chat$writer
          res$producer[r] <- tmp_chat$producer
        }
      }
      • get_results is called during each iteration, and if there’s an error a NA is returned.
    • Example

      pct_difference_error_handling <- function(n1, n2) {
      # Try the main code
        tryCatch(pct_diff <- (n1-n2)/n1,
              # If you find an error, use this code instead
                error = return(
                  cat( 'The difference between', as.integer(n1), 'and', as.integer(n2), 'is',
                        (as.integer(n1)-as.integer(n2)), 'which is',
                        100*(as.integer(n1)-as.integer(n2))/as.integer(n1),
                        '% of', n1 )#cat
                  ),
                # finally = print('Code ended') # optional
                )#trycatch
        # If no error happens, return this statement
        return ( cat('The difference between', n1, 'and', n2, 'is', n1-n2,
                    ', which is', pct_diff*100, '% of', n1) )
      }
      • Assumes the error will be the user enters a string instead of a numeric. If errors, converts string to numeric and calcs.
      • “finally” - This argument will always run, regardless if the try block raises an error or not. So it could be a completion message or a summary, for example.

Models

  • reformulate - Create formula sytax programmatically

    # Creating a formula using reformulate()
    formula <- reformulate(c("hp", "cyl"), response = "mpg")
    
    # Fitting a linear regression model
    model <- lm(formula, data = mtcars)
    
    formula
    ##> mpg ~ hp + cyl
    • Can also use as.formula
  • DF2formula - Turns the column names from a data frame into a formula. The first column will become the outcome variable, and the rest will be used as predictors

    DF2formula(Orange)
    #> Tree ~ age + circumference
  • formula - Provides a way of extracting formulae which have been included in other objects

    rec_obj |> prep() |> formula()
    • Where “rec_obj” is a tidymodels recipe object
  • Data from Model Object

    • model$model return the data dataframe
    • deparse(model$call$data) gives you that name of your data object as a string.
      • model$call$data gives you the data as an unevaluated symbol;
    • eval(model$call$data) gives you back the original data object, if it is available in the current environment.