data.table

Misc

  • Syntax

    DT[i, j, by]
    
    ##   R:                 i                 j        by
    ## SQL:  where | order by   select | update  group by
    • Take data.table DT, subset rows using i, and manipulate columns with j, grouped according to by.
  • Packages

    • {data.table.threads} - Finds the optimal/ideal speedup (efficiency factor) and thread count for each parallelizable function for your machine.
  • Resources

    • Docs but it’s difficult to find anything.
      • The philosophy of the package is highly dependent on syntax, so the reference page is not very useful in finding out how to perform certain operations as it usually is with other packages.
      • The search doesn’t include the articles which contain a lot of information.
      • Also, it’s an old package, and every old article, changelog, etc. is in the docs. So, if you find something you think answers your question, it may be that that syntax is outdated.
    • Introduction to data.table (vignette)
    • Syntax Reference (link)
    • Symbol Reference (link)
  • setDT(df)- Fast conversion of a data frame or list to a data.table without copying

    • Use when working with larger data sets that take up a considerable amount of RAM (several GBs) because the operation will modify each object in place, conserving memory.
    • as.data.table(matrix) should be used for matrices
    • dat <- data.table(df) can be used for small datasets but there’s no reason to.
    • setDT(copy(df)) if you want to work with a copy of the df instead of converting the original object.
  • Chaining: see Pivoting >> melt >> Multiple variables stored in column names for an example

  • Piping

    dt |> 
       _[, do_stuff(column), by = group] |> 
       _[, do_something_else(othr_col), by = othr_grp]
    • The _ placeholder allows you to use R’s native pipe.

    • Example

      penguins[species == "Chinstrap"] |> 
        _[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
      # or
      penguins[species == "Chinstrap"] |> 
        DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))
  • Symbols

    • .SD is a data.table containing the Subset of DT’s Data for each group, excluding any columns used in by (or keyby). Its usage is still confusing to me.

    • := is the walrus operator. let is an alias. Think it acts like dplyr::mutate or maybe dplyr::summarize. (Docs)

      DT[i, colC := mean(colB), by = colA]
      DT[i,
         `:=`(colC = sum(colB),
              colD = sum(colE))
         by = colF]
      DT[i,
         let(colC = sum(colB),
             colD = sum(colE)),
         by = colF] 
    • .I is the row index. It’s an integer vector equal to seq_len(nrow(x))

      dt <- data.table(
        a = 1:3,
        b = 4:6
      )
      dt[, .(a, b, rowsum = sum(.SD)), by = .I]
      #>        I     a     b rowsum
      #>    <int> <int> <int>  <int>
      #> 1:     1     1     4      5
      #> 2:     2     2     5      7
      #> 3:     3     3     6      9

Basic Usage

Using i

  • We can subset rows similar to a data.frame- except you don’t have to use DT$ repetitively since columns within the frame of a data.table are seen as if they are variables.
  • We can also sort a data.table using order(), which internally uses data.table’s fast order for performance.
  • We can do much more in i by keying a data.table, which allows blazing fast subsets and joins. We will see this in the “Keys and fast binary search based subsets” and “Joins and rolling joins” vignette.

Using j

  • Select columns the data.table way: DT[, .(colA, colB)].
  • Select columns the data.frame way: DT[, c("colA", "colB")].
  • Compute on columns: DT[, .(sum(colA), mean(colB))].
  • Provide names if necessary: DT[, .(sA =sum(colA), mB = mean(colB))].
  • Combine with i: DT[colA > value, sum(colB)].

Using by

  • Using by, we can group by columns by specifying a list of columns or a character vector of column names or even expressions. The flexibility of j, combined with by and i makes for a very powerful syntax.
  • by can handle multiple columns and also expressions.
  • We can keyby grouping columns to automatically sort the grouped result.
  • We can use .SD and .SDcols in j to operate on multiple columns using already familiar base functions. Here are some examples:
    • DT[, lapply(.SD, fun), by = ..., .SDcols = ...] - applies fun to all columns specified in .SDcols while grouping by the columns specified in by.
    • DT[, head(.SD, 2), by = ...] - return the first two rows for each group.
    • DT[col > val, head(.SD, 1), by = ...] - combine i along with j and by.

Columns

  • Rename Columns

    setnames(DT, 
             old = c("SIMD2020v2_Income_Domain_Rank",
                     "SIMD2020_Employment_Domain_Rank",  
                     "SIMD2020_Health_Domain_Rank",
                     "SIMD2020_Education_Domain_Rank", 
                     "SIMD2020_Access_Domain_Rank", 
                     "SIMD2020_Crime_Domain_Rank",    
                     "SIMD2020_Housing_Domain_Rank",
                     "CP_Name"),
    
             new = c("Income", "Employment", 
                     "Health",   "Education",
                     "Access",  "Crime", 
                     "Housing", "areaname"))

Filtering

  • Fast filtering mechanism; reorders rows (increasing) to group by the values in the key columns. Reordered rows make them easier to find and subset.

    • All types of columns can be used except list and complex
  • Operations covered in this section

    • Filtering
    • Filter, select
    • Filter, groupby, summarize
    • If-Else
  • Set Keys - Says order in the increasing direction according to origin and then dest.

    setkey(flights, origin, dest)
    head(flights)
    #    year month day dep_delay arr_delay carrier origin dest air_time distance hour
    # 1: 2014     1   2        -2       -25      EV    EWR  ALB      30      143    7
    # 2: 2014     1   3        88        79      EV    EWR  ALB      29      143   23
    # 3: 2014     1   4       220       211      EV    EWR  ALB      32      143   15
    # 4: 2014     1   4        35        19      EV    EWR  ALB      32      143    7
    # 5: 2014     1   5        47        42      EV    EWR  ALB      26      143    8
    # 6: 2014     1   5        66        62      EV    EWR  ALB      31      143   23
  • Filter by origin == “JFK” and dest == “MIA”

    flights[.("JFK", "MIA")]
    #      year month day dep_delay arr_delay carrier origin dest air_time distance hour
    #    1: 2014    1   1        -1       -17      AA    JFK  MIA      161    1089   15
    #    2: 2014    1   1         7        -8      AA    JFK  MIA      166    1089    9
    #    3: 2014    1   1         2        -1      AA    JFK  MIA      164    1089   12
    #    4: 2014    1   1         6         3      AA    JFK  MIA      157    1089    5
    #    5: 2014    1   1         6       -12      AA    JFK  MIA      154    1089   17
    #  ---                                                                             
    # 2746: 2014   10  31        -1       -22      AA    JFK  MIA      148    1089   16
    # 2747: 2014   10  31        -3       -20      AA    JFK  MIA      146    1089    8
    # 2748: 2014   10  31         2       -17      AA    JFK  MIA      150    1089    6
    # 2749: 2014   10  31        -3       -12      AA    JFK  MIA      150    1089    5
    # 2750: 2014   10  31        29         4      AA    JFK  MIA      146    1089   19
  • Filter by only the first key column (origin): flights["JFK"]

  • Filter by only the second key column (dest)

    flights[.(unique(), "MIA")]
    #      year month day dep_delay arr_delay carrier origin dest air_time distance hour
    #    1: 2014    1   1        -5       -17      AA    EWR  MIA      161    1085   16
    #    2: 2014    1   1        -3       -10      AA    EWR  MIA      154    1085    6
    #    3: 2014    1   1        -5        -8      AA    EWR  MIA      157    1085   11
    #    4: 2014    1   1        43        42      UA    EWR  MIA      155    1085   15
    #    5: 2014    1   1        60        49      UA    EWR  MIA      162    1085   21
    #  ---                                                                             
    # 9924: 2014   10  31       -11        -8      AA    LGA  MIA      157    1096   13
    # 9925: 2014   10  31        -5       -11      AA    LGA  MIA      150    1096    9
    # 9926: 2014   10  31        -2        10      AA    LGA  MIA      156    1096    6
    # 9927: 2014   10  31        -2       -16      AA    LGA  MIA      156    1096   19
    # 9928: 2014   10  31         1       -11      US    LGA  MIA      164    1096   15
  • Filter by origin and dest values, then summarize and pull maximum of arr_delay

    flights[.("LGA", "TPA"), max(arr_delay)]
    # [1] 486
  • Filter by three origin values, one dest value, return the last row for each match

    flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last"]
    #    year month day dep_delay arr_delay carrier origin dest air_time distance hour
    # 1: 2014     5  23       163       148      MQ    LGA  XNA      158    1147  18
    # 2:   NA    NA  NA        NA        NA      NA    JFK  XNA       NA      NA  NA
    # 3: 2014     2   3       231       268      EV    EWR  XNA      184    1131  12
    • Filtering by more than one key value returns combinations of the first key and second key
    • Remember setting a key reorders (increasing)

Summarize

  • Example: groupby state + min, max, mean

    D[ ,.(mean = mean(measurement),
          min = min(measurement),
          max = max(measurement)),
       by=state]
    
    # Supposedly faster
    rbindlist(lapply(unique(D$state), 
                     \(x) data.table(state = x, 
                                     y[state == x, 
                                       .(mean(measurement), 
                                         min(measurement), 
                                         max(measurement))
                                       ]
                                     )))
  • Filter by origin and dest values, then select a arr.delay column: flights[.("LGA", "TPA"), .(arr_delay)]

  • Filter by origin value, group_by month, summarize( max(dep_delay))

    ans <- flights["JFK", max(dep_delay), keyby = month]
    head(ans)
    #    month  V1
    # 1:    1  881
    # 2:    2 1014
    # 3:    3  920
    # 4:    4 1241
    # 5:    5  853
    # 6:    6  798
    key(ans)
    # [1] "month"
    • keyby groups and sets the key to month
  • Across

    # Across all columns
    DT[, names(.SD) := lapply(.SD, fun)]
    # Across all numeric columns
    DT[, names(.SD) := lapply(.SD, fun), .SDcols = is.numeric]

Joins

  • Left Equal Join

    DT <- lookup[DT, on = .(DataZone = Data_Zone)]
    DT <- merge(lookup, DT, by.x = "DataZone", by.y = "Data_Zone")
    • DT: A datatable where the id column is Data_Zone
    • lookup: A datatable where the id column is DataZone
    • Both datatables have the same number of rows so that makes this an Equal Join
    • DT is joined to lookup, so the columns of lookup appear first (farthest left) then DT’s columns (farthest right) of the joined datatable.
    • Subset Notation: The output datatable has the id column, DataZone, which is from lookup but the rows are ordered the same way as the input table, DT.
      • It’s weird that the output’s rows are ordered according to the input datatable
    • merge: The output datatable has the id column, DataZone, which is from lookup, and the rows are ordered according to lookup
    • The subset way is the “data.table” way, because you perform calculations on the output using the j position whereas with merge, it would require a chain or an extra line of code. But if the order of rows of the output matters, then I can’t find a way to reproduce the merge ordering using the subset method.

Conditionals

  • Ifelse using hour

    setkey(flights, hour) # hour has values 0-24
    flights[.(24), hour := 0L]
    • ifelse(hour == 24, 0, TRUE)
    • Consequence: since a key column value has changed, hour is no longer a key

Pivoting

pivot_longer and melt

  • Basic

    relig_income |>
      pivot_longer(!religion, # keep religion as a column
                  names_to = "income", # desired name for new column
                  values_to = "count") # what data goes into the new column?
    melt(DT, id.vars = "religion",
        variable.name = "income",
        value.name = "count",
        variable.factor = FALSE) # added to keep output consistent with tidyr
  • Columns have a common prefix and missing values are dropped

    billboard |>
      pivot_longer(
        cols = starts_with("wk"),
        names_to = "week",
        names_prefix = "wk",
        values_to = "rank",
        values_drop_na = TRUE
      )
    melt(DT,
        measure.vars = patterns("^wk"),
        variable.name = "week",
        value.name = "rank",
        na.rm = TRUE)
  • Multiple variables stored in column names

    who <- data.table(id = 1, new_sp_m5564 = 2, newrel_f65 = 3)
    #         id new_sp_m5564 newrel_f65
    #      <num>        <num>      <num>
    #   1:     1            2          3
    
    melt(who,
         measure.vars = measure(diagnosis,
                                gender,
                                ages,
                                pattern = "new_?(.*)_(.)(.*)"))
    #       id diagnosis gender   ages value
    #    <num>    <char> <char> <char> <num>
    # 1:     1        sp      m   5564     2
    # 2:     1       rel      f     65     3
    
    # with tidyr 
    who |> 
      tidyr::pivot_longer(
        cols = !id,
        names_to = c("diagnosis", "gender", "age"),
        names_pattern = "new_?(.*)_(.)(.*)",
        values_to = "count")
    # # A tibble: 2 × 5
    #           id diagnosis gender age   count
    #        <dbl> <chr>     <chr>  <chr> <dbl>
    # 1          1 sp        m      5564      2
    # 2          1 rel       f      65        3
    • tstrsplit is DT’s tidyr::separate
  • Matrix to long

    anscombe |>
      pivot_longer(
        everything(),
        cols_vary = "slowest",
        names_to = c(".value", "set"),
        names_pattern = "(.)(.)" 
      )
    DT[,melt(.SD,
                variable.name = "set",
                value.name = c("x","y"),
                variable.factor = FALSE,
                measure.vars = patterns("^x","^y"))]

pivot_wider and dcast

  • Data in examples

    • fish_encounters

      ## # A tibble: 114 × 3
      ##    fish  station  seen
      ##    <fct> <fct>    <int>
      ##  1 4842  Release     1
      ##  2 4842  I80_1       1
      ##  3 4842  Lisbon      1
      ##  4 4842  Rstr        1
      ##  5 4842  Base_TD     1
      ##  6 4842  BCE         1
      ##  7 4842  BCW         1
      ##  8 4842  BCE2        1
      ##  9 4842  BCW2        1
      ## 10 4842  MAE         1
      ## # … with 104 more rows
  • Basic

    fish_encounters |>
      pivot_wider(names_from = station, values_from = seen)
    
    dcast(DT, fish ~ station, value.var = "seen")
  • Fill in missing values

    fish_encounters |>
      pivot_wider(names_from = station, values_from = seen, values_fill = 0)
    
    dcast(DT, fish ~ station, value.var = "seen", fill = 0)
    # alt
    DT[, dcast(.SD, fish ~ station, value.var = "seen", fill = 0)]
    • Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to melt above)
  • Generate column names from multiple variables

    us_rent_income |>
      pivot_wider(
        names_from = variable,
        values_from = c(estimate, moe)
      )
    
    dcast(DT, GEOID + NAME ~ variable, 
              value.var = c("estimate","moe"))
    # alt
    dcast(DT, ... ~ variable, 
          value.var = c("estimate","moe"))
    • Alternative: pass “…” to indicate all other unspecified columns
  • Specify a different names separator

    us_rent_income |>
      pivot_wider(
        names_from = variable,
        names_sep = ".",
        values_from = c(estimate, moe)
      )
    
    dcast(DT, GEOID + NAME ~ variable,
          value.var = c("estimate","moe"), 
          sep = ".")
    # alt
    DT[, dcast(.SD, GEOID + NAME ~ variable,
        value.var = c("estimate","moe"), 
              sep = ".")]
    • Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to melt above)
  • Controlling how column names are combined

    us_rent_income |>
      pivot_wider(
        names_from = variable,
        values_from = c(estimate, moe),
        names_vary = "slowest"
      ) |> names()
    
    DT[, dcast(.SD, GEOID + NAME ~ variable,
              value.var = c("estimate","moe"))
      ][,c(1:3,5,4,6)] |> names()
    
    ## [1] "GEOID"          "NAME"            "estimate_income" "moe_income"     
    ## [5] "estimate_rent"  "moe_rent"
    • See {tidyr::pivot_wider} docs and the names_vary arg
  • Aggregation

    warpbreaks %>%
      pivot_wider(
        names_from = wool,
        values_from = breaks,
        values_fn = mean
      )
    dcast(DT, tension ~ wool, 
              value.var = "breaks", fun = mean)
    # alt
    DT[, dcast(.SD, tension ~ wool, 
          value.var = "breaks", fun = mean)]
    
    ## # A tibble: 3 × 3
    ##  tension    A    B
    ##  <fct>  <dbl> <dbl>
    ## 1 L        44.6  28.2
    ## 2 M        24    28.8
    ## 3 H        24.6  18.8
    • Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to melt above)

tidyr

  • separate via tstrsplit

    dt <- data.table(x = c("00531725 Male 2021 Neg", "07640613 Female 2020 Pos"))
    #                           x
    #                      <char>
    # 1:   00531725 Male 2021 Neg
    # 2: 07640613 Female 2020 Pos
    
    cols <- c("personID", "gender", "year", "covidTest")
    
    dt[, tstrsplit(x,
                   split = " ",
                   names = cols,
                   type.convert = TRUE)]
    #    personID gender  year covidTest
    #       <int> <char> <int>    <char>
    # 1:   531725   Male  2021       Neg
    # 2:  7640613 Female  2020       Pos
    
    
    dt[, tstrsplit(x,
                   split = " ",
                   names = cols,
                   type.convert = list(as.character = 1,
                                       as.factor = c(2, 4),
                                       as.integer = 3)
                   )]
    #    personID gender   year covidTest
    #      <char> <fctr>  <int>    <fctr>
    # 1: 00531725   Male   2021       Neg
    # 2: 07640613 Female   2020       Pos

User Defined Functions

  • env

    
    iris_dt <- as.data.table(iris)
    square = function(x) x^2
    
    iris_dt[filter_col %in% filter_val,
            .(var1, var2, out = outer(inner(var1) + inner(var2))),
            by = by_col,
            env = list(
              outer = "sqrt",
              inner = "square",
              var1 = "Sepal.Length",
              var2 = "Sepal.Width",
              out = "Sepal.Hypotenuse",
              filter_col = "Species",
              filter_val = I("versicolor"),
              by_col =  "Species"
            )] |> 
      head(n = 3)
    #       Species Sepal.Length Sepal.Width Sepal.Hypotenuse
    #        <fctr>        <num>       <num>            <num>
    # 1: versicolor          7.0         3.2         7.696753
    # 2: versicolor          6.4         3.2         7.155418
    # 3: versicolor          6.9         3.1         7.564390
    • Variables are included in the standard i, j, and by syntax
    • env contains the (quoted) variable values
      • i.e. argument values in the typical R udf syntax (function(x = val1))
      • Can use other UDFs as values which is demonstrated by inner = “square”

Recipes

  • Operations covered in this section

    • group_by, summarize (and arrange)
    • crosstab
  • group_by, summarize (and arrange)

    dt_res <- dtstudy[, .(n = .N, avg = round(mean(y), 1)), keyby = .(male, over65, rx)]
    
    tb_study <- tibble::as_tibble(dtstudy)
    tb_res <- tb_study |>
      summarize(n = n(),
                avg = round(mean(y), 1),
                .by = c(male, over65, rx)) |>
      arrange(male, over65, rx)
    • dt automatically orders by the grouping variables, so to get the exact output, you have to add an arrange
  • Crosstab using cube (Titanic5 dataset)

    # Note that the mean of a 0/1 variable is the proportion of 1s
    mn <- function(x) mean(x, na.rm=TRUE)
    # Create a function that counts the number of non-NA values
    Nna <- function(x) sum(! is.na(x))
    
    cube(d, .(Proportion=mn(survived), N=Nna(survived)), by=.q(sex, class), id=TRUE)
    
    #>     grouping    sex class Proportion    N
    #> 1:         0 female     1  0.9652778  144
    #> 2:         0   male     1  0.3444444  180
    #> 3:         0   male     2  0.1411765  170
    #> 4:         0 female     2  0.8867925  106
    #> 5:         0   male     3  0.1521298  493
    #> 6:         0 female     3  0.4907407  216
    #> 7:         1 female    NA  0.7274678  466
    #> 8:         1   male    NA  0.1909846  843
    #> 9:         2   <NA>     1  0.6203704  324
    #> 10:        2   <NA>     2  0.4275362  276
    #> 11:        2   <NA>     3  0.2552891  709
    #> 12:        3   <NA>    NA  0.3819710 1309