data.table

Misc

Syntax

DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

Take data.table DT, subset rows using i, and manipulate columns with j, grouped according to by.

Resources
- Docs but it’s difficult to find anything.
  - The philosophy of the package is highly dependent on syntax, so the reference page is not very useful in finding out how to perform certain operations as it usually is with other packages.
  - The search doesn’t include the articles which contain a lot of information.
  - Also, it’s an old package, and every old article, changelog, etc. is in the docs. So, if you find something you think answers your question, it may be that that syntax is outdated.
- Introduction to data.table (vignette)
- Syntax Reference (link)
- Symbol Reference (link)
setDT(df)- Fast conversion of a data frame or list to a data.table without copying
- Use when working with larger data sets that take up a considerable amount of RAM (several GBs) because the operation will modify each object in place, conserving memory.
- as.data.table(matrix) should be used for matrices
- dat <- data.table(df) can be used for small datasets but there’s no reason to.
- setDT(copy(df)) if you want to work with a copy of the df instead of converting the original object.
Chaining: see Pivoting >> melt >> Multiple variables stored in column names for an example

Piping

dt |> 
   _[, do_stuff(column), by = group] |> 
   _[, do_something_else(othr_col), by = othr_grp]

The _ placeholder allows you to use R’s native pipe.

Example

penguins[species == "Chinstrap"] |> 
  _[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
# or
penguins[species == "Chinstrap"] |> 
  DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))

Symbols

.SD is a data.table containing the Subset of DT’s Data for each group, excluding any columns used in by (or keyby). Its usage is still confusing to me.

:= is the walrus operator. let is an alias. Think it acts like dplyr::mutate or maybe dplyr::summarize. (Docs)

DT[i, colC := mean(colB), by = colA]
DT[i,
   `:=`(colC = sum(colB),
        colD = sum(colE))
   by = colF]
DT[i,
   let(colC = sum(colB),
       colD = sum(colE)),
   by = colF]

.I is the row index. It’s an integer vector equal to seq_len(nrow(x))

dt <- data.table(
  a = 1:3,
  b = 4:6
)
dt[, .(a, b, rowsum = sum(.SD)), by = .I]
#>        I     a     b rowsum
#>    <int> <int> <int>  <int>
#> 1:     1     1     4      5
#> 2:     2     2     5      7
#> 3:     3     3     6      9

Basic Usage

Using i

We can subset rows similar to a data.frame- except you don’t have to use DT$ repetitively since columns within the frame of a data.table are seen as if they are variables.
We can also sort a data.table using order(), which internally uses data.table’s fast order for performance.
We can do much more in i by keying a data.table, which allows blazing fast subsets and joins. We will see this in the “Keys and fast binary search based subsets” and “Joins and rolling joins” vignette.

Using j

Select columns the data.table way: DT[, .(colA, colB)].
Select columns the data.frame way: DT[, c("colA", "colB")].
Compute on columns: DT[, .(sum(colA), mean(colB))].
Provide names if necessary: DT[, .(sA =sum(colA), mB = mean(colB))].
Combine with i: DT[colA > value, sum(colB)].

Using by

Using by, we can group by columns by specifying a list of columns or a character vector of column names or even expressions. The flexibility of j, combined with by and i makes for a very powerful syntax.
by can handle multiple columns and also expressions.
We can keyby grouping columns to automatically sort the grouped result.
We can use .SD and .SDcols in j to operate on multiple columns using already familiar base functions. Here are some examples:
- DT[, lapply(.SD, fun), by = ..., .SDcols = ...] - applies fun to all columns specified in .SDcols while grouping by the columns specified in by.
- DT[, head(.SD, 2), by = ...] - return the first two rows for each group.
- DT[col > val, head(.SD, 1), by = ...] - combine i along with j and by.

Columns

Rename Columns

setnames(DT, 
         old = c("SIMD2020v2_Income_Domain_Rank",
                 "SIMD2020_Employment_Domain_Rank",  
                 "SIMD2020_Health_Domain_Rank",
                 "SIMD2020_Education_Domain_Rank", 
                 "SIMD2020_Access_Domain_Rank", 
                 "SIMD2020_Crime_Domain_Rank",    
                 "SIMD2020_Housing_Domain_Rank",
                 "CP_Name"),

         new = c("Income", "Employment", 
                 "Health",   "Education",
                 "Access",  "Crime", 
                 "Housing", "areaname"))

Filtering

Fast filtering mechanism; reorders rows (increasing) to group by the values in the key columns. Reordered rows make them easier to find and subset.
- All types of columns can be used except list and complex
Operations covered in this section
- Filtering
- Filter, select
- Filter, groupby, summarize
- If-Else

Set Keys - Says order in the increasing direction according to origin and then dest.

setkey(flights, origin, dest)
head(flights)
#    year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014     1   2        -2       -25      EV    EWR  ALB      30      143    7
# 2: 2014     1   3        88        79      EV    EWR  ALB      29      143   23
# 3: 2014     1   4       220       211      EV    EWR  ALB      32      143   15
# 4: 2014     1   4        35        19      EV    EWR  ALB      32      143    7
# 5: 2014     1   5        47        42      EV    EWR  ALB      26      143    8
# 6: 2014     1   5        66        62      EV    EWR  ALB      31      143   23

Filter by origin == “JFK” and dest == “MIA”

flights[.("JFK", "MIA")]
#      year month day dep_delay arr_delay carrier origin dest air_time distance hour
#    1: 2014    1   1        -1       -17      AA    JFK  MIA      161    1089   15
#    2: 2014    1   1         7        -8      AA    JFK  MIA      166    1089    9
#    3: 2014    1   1         2        -1      AA    JFK  MIA      164    1089   12
#    4: 2014    1   1         6         3      AA    JFK  MIA      157    1089    5
#    5: 2014    1   1         6       -12      AA    JFK  MIA      154    1089   17
#  ---                                                                             
# 2746: 2014   10  31        -1       -22      AA    JFK  MIA      148    1089   16
# 2747: 2014   10  31        -3       -20      AA    JFK  MIA      146    1089    8
# 2748: 2014   10  31         2       -17      AA    JFK  MIA      150    1089    6
# 2749: 2014   10  31        -3       -12      AA    JFK  MIA      150    1089    5
# 2750: 2014   10  31        29         4      AA    JFK  MIA      146    1089   19

Filter by only the first key column (origin): flights["JFK"]

Filter by only the second key column (dest)

flights[.(unique(), "MIA")]
#      year month day dep_delay arr_delay carrier origin dest air_time distance hour
#    1: 2014    1   1        -5       -17      AA    EWR  MIA      161    1085   16
#    2: 2014    1   1        -3       -10      AA    EWR  MIA      154    1085    6
#    3: 2014    1   1        -5        -8      AA    EWR  MIA      157    1085   11
#    4: 2014    1   1        43        42      UA    EWR  MIA      155    1085   15
#    5: 2014    1   1        60        49      UA    EWR  MIA      162    1085   21
#  ---                                                                             
# 9924: 2014   10  31       -11        -8      AA    LGA  MIA      157    1096   13
# 9925: 2014   10  31        -5       -11      AA    LGA  MIA      150    1096    9
# 9926: 2014   10  31        -2        10      AA    LGA  MIA      156    1096    6
# 9927: 2014   10  31        -2       -16      AA    LGA  MIA      156    1096   19
# 9928: 2014   10  31         1       -11      US    LGA  MIA      164    1096   15

Filter by origin and dest values, then summarize and pull maximum of arr_delay
```
flights[.("LGA", "TPA"), max(arr_delay)]
# [1] 486
```

Filter by three origin values, one dest value, return the last row for each match

flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last"]
#    year month day dep_delay arr_delay carrier origin dest air_time distance hour
# 1: 2014     5  23       163       148      MQ    LGA  XNA      158    1147  18
# 2:   NA    NA  NA        NA        NA      NA    JFK  XNA       NA      NA  NA
# 3: 2014     2   3       231       268      EV    EWR  XNA      184    1131  12

Filtering by more than one key value returns combinations of the first key and second key
Remember setting a key reorders (increasing)

Summarize

Example: groupby state + min, max, mean

D[ ,.(mean = mean(measurement),
      min = min(measurement),
      max = max(measurement)),
   by=state]

# Supposedly faster
rbindlist(lapply(unique(D$state), 
                 \(x) data.table(state = x, 
                                 y[state == x, 
                                   .(mean(measurement), 
                                     min(measurement), 
                                     max(measurement))
                                   ]
                                 )))

Filter by origin and dest values, then select a arr.delay column: flights[.("LGA", "TPA"), .(arr_delay)]

Filter by origin value, group_by month, summarize( max(dep_delay))

ans <- flights["JFK", max(dep_delay), keyby = month]
head(ans)
#    month  V1
# 1:    1  881
# 2:    2 1014
# 3:    3  920
# 4:    4 1241
# 5:    5  853
# 6:    6  798
key(ans)
# [1] "month"

keyby groups and sets the key to month

Joins

Left Equal Join
```
DT <- lookup[DT, on = .(DataZone = Data_Zone)]
DT <- merge(lookup, DT, by.x = "DataZone", by.y = "Data_Zone")
```
- DT: A datatable where the id column is Data_Zone
- lookup: A datatable where the id column is DataZone
- Both datatables have the same number of rows so that makes this an Equal Join
- DT is joined to lookup, so the columns of lookup appear first (farthest left) then DT’s columns (farthest right) of the joined datatable.
- Subset Notation: The output datatable has the id column, DataZone, which is from lookup but the rows are ordered the same way as the input table, DT.
  - It’s weird that the output’s rows are ordered according to the input datatable
- merge: The output datatable has the id column, DataZone, which is from lookup, and the rows are ordered according to lookup
- The subset way is the “data.table” way, because you perform calculations on the output using the j position whereas with merge, it would require a chain or an extra line of code. But if the order of rows of the output matters, then I can’t find a way to reproduce the merge ordering using the subset method.

Conditionals

Ifelse using hour
```
setkey(flights, hour) # hour has values 0-24
flights[.(24), hour := 0L]
```
- ifelse(hour == 24, 0, TRUE)
- Consequence: since a key column value has changed, hour is no longer a key

Pivoting

pivot_longer and melt

Basic

relig_income |>
  pivot_longer(!religion, # keep religion as a column
              names_to = "income", # desired name for new column
              values_to = "count") # what data goes into the new column?
melt(DT, id.vars = "religion",
    variable.name = "income",
    value.name = "count",
    variable.factor = FALSE) # added to keep output consistent with tidyr

Columns have a common prefix and missing values are dropped

billboard |>
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    names_prefix = "wk",
    values_to = "rank",
    values_drop_na = TRUE
  )
melt(DT,
    measure.vars = patterns("^wk"),
    variable.name = "week",
    value.name = "rank",
    na.rm = TRUE)

Multiple variables stored in column names

who <- data.table(id = 1, new_sp_m5564 = 2, newrel_f65 = 3)
#         id new_sp_m5564 newrel_f65
#      <num>        <num>      <num>
#   1:     1            2          3

melt(who,
     measure.vars = measure(diagnosis,
                            gender,
                            ages,
                            pattern = "new_?(.*)_(.)(.*)"))
#       id diagnosis gender   ages value
#    <num>    <char> <char> <char> <num>
# 1:     1        sp      m   5564     2
# 2:     1       rel      f     65     3

# with tidyr 
who |> 
  tidyr::pivot_longer(
    cols = !id,
    names_to = c("diagnosis", "gender", "age"),
    names_pattern = "new_?(.*)_(.)(.*)",
    values_to = "count")
# # A tibble: 2 × 5
#           id diagnosis gender age   count
#        <dbl> <chr>     <chr>  <chr> <dbl>
# 1          1 sp        m      5564      2
# 2          1 rel       f      65        3

tstrsplit is DT’s tidyr::separate

Matrix to long

anscombe |>
  pivot_longer(
    everything(),
    cols_vary = "slowest",
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)" 
  )
DT[,melt(.SD,
            variable.name = "set",
            value.name = c("x","y"),
            variable.factor = FALSE,
            measure.vars = patterns("^x","^y"))]

pivot_wider and dcast

Data in examples

fish_encounters

## # A tibble: 114 × 3
##    fish  station  seen
##    <fct> <fct>    <int>
##  1 4842  Release     1
##  2 4842  I80_1       1
##  3 4842  Lisbon      1
##  4 4842  Rstr        1
##  5 4842  Base_TD     1
##  6 4842  BCE         1
##  7 4842  BCW         1
##  8 4842  BCE2        1
##  9 4842  BCW2        1
## 10 4842  MAE         1
## # … with 104 more rows

Basic

fish_encounters |>
  pivot_wider(names_from = station, values_from = seen)

dcast(DT, fish ~ station, value.var = "seen")

Fill in missing values

fish_encounters |>
  pivot_wider(names_from = station, values_from = seen, values_fill = 0)

dcast(DT, fish ~ station, value.var = "seen", fill = 0)
# alt
DT[, dcast(.SD, fish ~ station, value.var = "seen", fill = 0)]

Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to melt above)

Generate column names from multiple variables

us_rent_income |>
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe)
  )

dcast(DT, GEOID + NAME ~ variable, 
          value.var = c("estimate","moe"))
# alt
dcast(DT, ... ~ variable, 
      value.var = c("estimate","moe"))

Alternative: pass “…” to indicate all other unspecified columns

Specify a different names separator

us_rent_income |>
  pivot_wider(
    names_from = variable,
    names_sep = ".",
    values_from = c(estimate, moe)
  )

dcast(DT, GEOID + NAME ~ variable,
      value.var = c("estimate","moe"), 
      sep = ".")
# alt
DT[, dcast(.SD, GEOID + NAME ~ variable,
    value.var = c("estimate","moe"), 
          sep = ".")]

Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to melt above)

Controlling how column names are combined

us_rent_income |>
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe),
    names_vary = "slowest"
  ) |> names()

DT[, dcast(.SD, GEOID + NAME ~ variable,
          value.var = c("estimate","moe"))
  ][,c(1:3,5,4,6)] |> names()

## [1] "GEOID"          "NAME"            "estimate_income" "moe_income"     
## [5] "estimate_rent"  "moe_rent"

See {tidyr::pivot_wider} docs and the names_vary arg

Aggregation

warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = mean
  )
dcast(DT, tension ~ wool, 
          value.var = "breaks", fun = mean)
# alt
DT[, dcast(.SD, tension ~ wool, 
      value.var = "breaks", fun = mean)]

## # A tibble: 3 × 3
##  tension    A    B
##  <fct>  <dbl> <dbl>
## 1 L        44.6  28.2
## 2 M        24    28.8
## 3 H        24.6  18.8

Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to melt above)

tidyr

separate via tstrsplit

dt <- data.table(x = c("00531725 Male 2021 Neg", "07640613 Female 2020 Pos"))
#                           x
#                      <char>
# 1:   00531725 Male 2021 Neg
# 2: 07640613 Female 2020 Pos

cols <- c("personID", "gender", "year", "covidTest")

dt[, tstrsplit(x,
               split = " ",
               names = cols,
               type.convert = TRUE)]
#    personID gender  year covidTest
#       <int> <char> <int>    <char>
# 1:   531725   Male  2021       Neg
# 2:  7640613 Female  2020       Pos


dt[, tstrsplit(x,
               split = " ",
               names = cols,
               type.convert = list(as.character = 1,
                                   as.factor = c(2, 4),
                                   as.integer = 3)
               )]
#    personID gender   year covidTest
#      <char> <fctr>  <int>    <fctr>
# 1: 00531725   Male   2021       Neg
# 2: 07640613 Female   2020       Pos

User Defined Functions

env


iris_dt <- as.data.table(iris)
square = function(x) x^2

iris_dt[filter_col %in% filter_val,
        .(var1, var2, out = outer(inner(var1) + inner(var2))),
        by = by_col,
        env = list(
          outer = "sqrt",
          inner = "square",
          var1 = "Sepal.Length",
          var2 = "Sepal.Width",
          out = "Sepal.Hypotenuse",
          filter_col = "Species",
          filter_val = I("versicolor"),
          by_col =  "Species"
        )] |> 
  head(n = 3)
#       Species Sepal.Length Sepal.Width Sepal.Hypotenuse
#        <fctr>        <num>       <num>            <num>
# 1: versicolor          7.0         3.2         7.696753
# 2: versicolor          6.4         3.2         7.155418
# 3: versicolor          6.9         3.1         7.564390

Variables are included in the standard i, j, and by syntax
env contains the (quoted) variable values
- i.e. argument values in the typical R udf syntax (function(x = val1))
- Can use other UDFs as values which is demonstrated by inner = “square”

Recipes

Operations covered in this section
- group_by, summarize (and arrange)
- crosstab

group_by, summarize (and arrange)

dt_res <- dtstudy[, .(n = .N, avg = round(mean(y), 1)), keyby = .(male, over65, rx)]

tb_study <- tibble::as_tibble(dtstudy)
tb_res <- tb_study |>
  summarize(n = n(),
            avg = round(mean(y), 1),
            .by = c(male, over65, rx)) |>
  arrange(male, over65, rx)

dt automatically orders by the grouping variables, so to get the exact output, you have to add an arrange

Crosstab using cube (Titanic5 dataset)

# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))

cube(d, .(Proportion=mn(survived), N=Nna(survived)), by=.q(sex, class), id=TRUE)

#>     grouping    sex class Proportion    N
#> 1:         0 female     1  0.9652778  144
#> 2:         0   male     1  0.3444444  180
#> 3:         0   male     2  0.1411765  170
#> 4:         0 female     2  0.8867925  106
#> 5:         0   male     3  0.1521298  493
#> 6:         0 female     3  0.4907407  216
#> 7:         1 female    NA  0.7274678  466
#> 8:         1   male    NA  0.1909846  843
#> 9:         2   <NA>     1  0.6203704  324
#> 10:        2   <NA>     2  0.4275362  276
#> 11:        2   <NA>     3  0.2552891  709
#> 12:        3   <NA>    NA  0.3819710 1309