data.table
Misc
Syntax
DT[i, j, by] ## R: i j by ## SQL: where | order by select | update group by
- Take data.table DT, subset rows using i, and manipulate columns with j, grouped according to by.
Resources
- Docs but it’s difficult to find anything.
- The philosophy of the package is highly dependent on syntax, so the reference page is not very useful in finding out how to perform certain operations as it usually is with other packages.
- The search doesn’t include the articles which contain a lot of information.
- Also, it’s an old package, and every old article, changelog, etc. is in the docs. So, if you find something you think answers your question, it may be that that syntax is outdated.
- Introduction to data.table (vignette)
- Syntax Reference (link)
- Symbol Reference (link)
- Docs but it’s difficult to find anything.
setDT(df)
- Fast conversion of a data frame or list to a data.table without copying- Use when working with larger data sets that take up a considerable amount of RAM (several GBs) because the operation will modify each object in place, conserving memory.
as.data.table(matrix)
should be used for matricesdat <- data.table(df)
can be used for small datasets but there’s no reason to.setDT(copy(df))
if you want to work with a copy of the df instead of converting the original object.
Chaining: see Pivoting >> melt >> Multiple variables stored in column names for an example
Piping
|> dt do_stuff(column), by = group] |> _[, do_something_else(othr_col), by = othr_grp] _[,
The
_
placeholder allows you to use R’s native pipe.Example
== "Chinstrap"] |> penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] _[ , .(# or == "Chinstrap"] |> penguins[species DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))
Symbols
.SD
is a data.table containing the Subset of DT’s Data for each group, excluding any columns used inby
(orkeyby
). Its usage is still confusing to me.:=
is the walrus operator.let
is an alias. Think it acts likedplyr::mutate
or maybedplyr::summarize
. (Docs):= mean(colB), by = colA] DT[i, colC DT[i,`:=`(colC = sum(colB), colD = sum(colE)) = colF] by DT[i,let(colC = sum(colB), colD = sum(colE)), = colF] by
.I
is the row index. It’s an integer vector equal toseq_len(nrow(x))
<- data.table( dt a = 1:3, b = 4:6 )rowsum = sum(.SD)), by = .I] dt[, .(a, b, #> I a b rowsum #> <int> <int> <int> <int> #> 1: 1 1 4 5 #> 2: 2 2 5 7 #> 3: 3 3 6 9
Basic Usage
Using i
- We can subset rows similar to a
data.frame
- except you don’t have to useDT$
repetitively since columns within the frame of adata.table
are seen as if they are variables. - We can also sort a
data.table
usingorder()
, which internally usesdata.table
’s fast order for performance. - We can do much more in
i
by keying adata.table
, which allows blazing fast subsets and joins. We will see this in the “Keys and fast binary search based subsets” and “Joins and rolling joins” vignette.
Using j
- Select columns the
data.table
way:DT[, .(colA, colB)]
. - Select columns the
data.frame
way:DT[, c("colA", "colB")]
. - Compute on columns:
DT[, .(sum(colA), mean(colB))]
. - Provide names if necessary:
DT[, .(sA =sum(colA), mB = mean(colB))]
. - Combine with
i
:DT[colA > value, sum(colB)]
.
Using by
- Using
by
, we can group by columns by specifying a list of columns or a character vector of column names or even expressions. The flexibility ofj
, combined withby
andi
makes for a very powerful syntax. by
can handle multiple columns and also expressions.- We can
keyby
grouping columns to automatically sort the grouped result. - We can use
.SD
and.SDcols
inj
to operate on multiple columns using already familiar base functions. Here are some examples:DT[, lapply(.SD, fun), by = ..., .SDcols = ...]
- appliesfun
to all columns specified in.SDcols
while grouping by the columns specified inby
.DT[, head(.SD, 2), by = ...]
- return the first two rows for each group.DT[col > val, head(.SD, 1), by = ...]
- combinei
along withj
andby
.
Columns
Rename Columns
setnames(DT, old = c("SIMD2020v2_Income_Domain_Rank", "SIMD2020_Employment_Domain_Rank", "SIMD2020_Health_Domain_Rank", "SIMD2020_Education_Domain_Rank", "SIMD2020_Access_Domain_Rank", "SIMD2020_Crime_Domain_Rank", "SIMD2020_Housing_Domain_Rank", "CP_Name"), new = c("Income", "Employment", "Health", "Education", "Access", "Crime", "Housing", "areaname"))
Filtering
Fast filtering mechanism; reorders rows (increasing) to group by the values in the key columns. Reordered rows make them easier to find and subset.
- All types of columns can be used except list and complex
Operations covered in this section
- Filtering
- Filter, select
- Filter, groupby, summarize
- If-Else
Set Keys - Says order in the increasing direction according to origin and then dest.
setkey(flights, origin, dest) head(flights) # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 2 -2 -25 EV EWR ALB 30 143 7 # 2: 2014 1 3 88 79 EV EWR ALB 29 143 23 # 3: 2014 1 4 220 211 EV EWR ALB 32 143 15 # 4: 2014 1 4 35 19 EV EWR ALB 32 143 7 # 5: 2014 1 5 47 42 EV EWR ALB 26 143 8 # 6: 2014 1 5 66 62 EV EWR ALB 31 143 23
Filter by origin == “JFK” and dest == “MIA”
"JFK", "MIA")] flights[.(# year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 1 -1 -17 AA JFK MIA 161 1089 15 # 2: 2014 1 1 7 -8 AA JFK MIA 166 1089 9 # 3: 2014 1 1 2 -1 AA JFK MIA 164 1089 12 # 4: 2014 1 1 6 3 AA JFK MIA 157 1089 5 # 5: 2014 1 1 6 -12 AA JFK MIA 154 1089 17 # --- # 2746: 2014 10 31 -1 -22 AA JFK MIA 148 1089 16 # 2747: 2014 10 31 -3 -20 AA JFK MIA 146 1089 8 # 2748: 2014 10 31 2 -17 AA JFK MIA 150 1089 6 # 2749: 2014 10 31 -3 -12 AA JFK MIA 150 1089 5 # 2750: 2014 10 31 29 4 AA JFK MIA 146 1089 19
Filter by only the first key column (origin):
flights["JFK"]
Filter by only the second key column (dest)
unique(), "MIA")] flights[.(# year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16 # 2: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6 # 3: 2014 1 1 -5 -8 AA EWR MIA 157 1085 11 # 4: 2014 1 1 43 42 UA EWR MIA 155 1085 15 # 5: 2014 1 1 60 49 UA EWR MIA 162 1085 21 # --- # 9924: 2014 10 31 -11 -8 AA LGA MIA 157 1096 13 # 9925: 2014 10 31 -5 -11 AA LGA MIA 150 1096 9 # 9926: 2014 10 31 -2 10 AA LGA MIA 156 1096 6 # 9927: 2014 10 31 -2 -16 AA LGA MIA 156 1096 19 # 9928: 2014 10 31 1 -11 US LGA MIA 164 1096 15
Filter by origin and dest values, then summarize and pull maximum of arr_delay
"LGA", "TPA"), max(arr_delay)] flights[.(# [1] 486
Filter by three origin values, one dest value, return the last row for each match
c("LGA", "JFK", "EWR"), "XNA"), mult = "last"] flights[.(# year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18 # 2: NA NA NA NA NA NA JFK XNA NA NA NA # 3: 2014 2 3 231 268 EV EWR XNA 184 1131 12
- Filtering by more than one key value returns combinations of the first key and second key
- Remember setting a key reorders (increasing)
Summarize
Example: groupby state + min, max, mean
mean = mean(measurement), D[ ,.(min = min(measurement), max = max(measurement)), =state] by # Supposedly faster rbindlist(lapply(unique(D$state), data.table(state = x, \(x) == x, y[state mean(measurement), .(min(measurement), max(measurement)) ] )))
Filter by origin and dest values, then select a arr.delay column:
flights[.("LGA", "TPA"), .(arr_delay)]
Filter by origin value, group_by month, summarize( max(dep_delay))
<- flights["JFK", max(dep_delay), keyby = month] ans head(ans) # month V1 # 1: 1 881 # 2: 2 1014 # 3: 3 920 # 4: 4 1241 # 5: 5 853 # 6: 6 798 key(ans) # [1] "month"
- keyby groups and sets the key to month
Joins
Left Equal Join
<- lookup[DT, on = .(DataZone = Data_Zone)] DT <- merge(lookup, DT, by.x = "DataZone", by.y = "Data_Zone") DT
- DT: A datatable where the id column is Data_Zone
- lookup: A datatable where the id column is DataZone
- Both datatables have the same number of rows so that makes this an Equal Join
- DT is joined to lookup, so the columns of lookup appear first (farthest left) then DT’s columns (farthest right) of the joined datatable.
- Subset Notation: The output datatable has the id column, DataZone, which is from lookup but the rows are ordered the same way as the input table, DT.
- It’s weird that the output’s rows are ordered according to the input datatable
merge
: The output datatable has the id column, DataZone, which is from lookup, and the rows are ordered according to lookup- The subset way is the “data.table” way, because you perform calculations on the output using the j position whereas with
merge
, it would require a chain or an extra line of code. But if the order of rows of the output matters, then I can’t find a way to reproduce themerge
ordering using the subset method.
Conditionals
Ifelse using hour
setkey(flights, hour) # hour has values 0-24 24), hour := 0L] flights[.(
ifelse(hour == 24, 0, TRUE)
- Consequence: since a key column value has changed, hour is no longer a key
Pivoting
pivot_longer and melt
Basic
|> relig_income pivot_longer(!religion, # keep religion as a column names_to = "income", # desired name for new column values_to = "count") # what data goes into the new column? melt(DT, id.vars = "religion", variable.name = "income", value.name = "count", variable.factor = FALSE) # added to keep output consistent with tidyr
Columns have a common prefix and missing values are dropped
|> billboard pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE )melt(DT, measure.vars = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE)
Multiple variables stored in column names
<- data.table(id = 1, new_sp_m5564 = 2, newrel_f65 = 3) who # id new_sp_m5564 newrel_f65 # <num> <num> <num> # 1: 1 2 3 melt(who, measure.vars = measure(diagnosis, gender, ages,pattern = "new_?(.*)_(.)(.*)")) # id diagnosis gender ages value # <num> <char> <char> <char> <num> # 1: 1 sp m 5564 2 # 2: 1 rel f 65 3 # with tidyr |> who ::pivot_longer( tidyrcols = !id, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", values_to = "count") # # A tibble: 2 × 5 # id diagnosis gender age count # <dbl> <chr> <chr> <chr> <dbl> # 1 1 sp m 5564 2 # 2 1 rel f 65 3
tstrsplit
is DT’stidyr::separate
Matrix to long
|> anscombe pivot_longer( everything(), cols_vary = "slowest", names_to = c(".value", "set"), names_pattern = "(.)(.)" )melt(.SD, DT[,variable.name = "set", value.name = c("x","y"), variable.factor = FALSE, measure.vars = patterns("^x","^y"))]
pivot_wider and dcast
Data in examples
fish_encounters
## # A tibble: 114 × 3 ## fish station seen ## <fct> <fct> <int> ## 1 4842 Release 1 ## 2 4842 I80_1 1 ## 3 4842 Lisbon 1 ## 4 4842 Rstr 1 ## 5 4842 Base_TD 1 ## 6 4842 BCE 1 ## 7 4842 BCW 1 ## 8 4842 BCE2 1 ## 9 4842 BCW2 1 ## 10 4842 MAE 1 ## # … with 104 more rows
Basic
|> fish_encounters pivot_wider(names_from = station, values_from = seen) dcast(DT, fish ~ station, value.var = "seen")
Fill in missing values
|> fish_encounters pivot_wider(names_from = station, values_from = seen, values_fill = 0) dcast(DT, fish ~ station, value.var = "seen", fill = 0) # alt dcast(.SD, fish ~ station, value.var = "seen", fill = 0)] DT[,
- Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
melt
above)
- Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
Generate column names from multiple variables
|> us_rent_income pivot_wider( names_from = variable, values_from = c(estimate, moe) ) dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe")) # alt dcast(DT, ... ~ variable, value.var = c("estimate","moe"))
- Alternative: pass “…” to indicate all other unspecified columns
Specify a different names separator
|> us_rent_income pivot_wider( names_from = variable, names_sep = ".", values_from = c(estimate, moe) ) dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe"), sep = ".") # alt dcast(.SD, GEOID + NAME ~ variable, DT[, value.var = c("estimate","moe"), sep = ".")]
- Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
melt
above)
- Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
Controlling how column names are combined
|> us_rent_income pivot_wider( names_from = variable, values_from = c(estimate, moe), names_vary = "slowest" |> names() ) dcast(.SD, GEOID + NAME ~ variable, DT[, value.var = c("estimate","moe")) c(1:3,5,4,6)] |> names() ][, ## [1] "GEOID" "NAME" "estimate_income" "moe_income" ## [5] "estimate_rent" "moe_rent"
- See {tidyr::pivot_wider} docs and the names_vary arg
Aggregation
%>% warpbreaks pivot_wider( names_from = wool, values_from = breaks, values_fn = mean )dcast(DT, tension ~ wool, value.var = "breaks", fun = mean) # alt dcast(.SD, tension ~ wool, DT[, value.var = "breaks", fun = mean)] ## # A tibble: 3 × 3 ## tension A B ## <fct> <dbl> <dbl> ## 1 L 44.6 28.2 ## 2 M 24 28.8 ## 3 H 24.6 18.8
- Alternative: Rather than have the DT inside
dcast
, we can use .SD and havedcast
inside DT, which is helpful for further chaining. (see applied tomelt
above)
- Alternative: Rather than have the DT inside
tidyr
separate
viatstrsplit
<- data.table(x = c("00531725 Male 2021 Neg", "07640613 Female 2020 Pos")) dt # x # <char> # 1: 00531725 Male 2021 Neg # 2: 07640613 Female 2020 Pos <- c("personID", "gender", "year", "covidTest") cols tstrsplit(x, dt[, split = " ", names = cols, type.convert = TRUE)] # personID gender year covidTest # <int> <char> <int> <char> # 1: 531725 Male 2021 Neg # 2: 7640613 Female 2020 Pos tstrsplit(x, dt[, split = " ", names = cols, type.convert = list(as.character = 1, as.factor = c(2, 4), as.integer = 3) )]# personID gender year covidTest # <char> <fctr> <int> <fctr> # 1: 00531725 Male 2021 Neg # 2: 07640613 Female 2020 Pos
User Defined Functions
env
<- as.data.table(iris) iris_dt = function(x) x^2 square %in% filter_val, iris_dt[filter_col out = outer(inner(var1) + inner(var2))), .(var1, var2, = by_col, by = list( env outer = "sqrt", inner = "square", var1 = "Sepal.Length", var2 = "Sepal.Width", out = "Sepal.Hypotenuse", filter_col = "Species", filter_val = I("versicolor"), by_col = "Species" |> )] head(n = 3) # Species Sepal.Length Sepal.Width Sepal.Hypotenuse # <fctr> <num> <num> <num> # 1: versicolor 7.0 3.2 7.696753 # 2: versicolor 6.4 3.2 7.155418 # 3: versicolor 6.9 3.1 7.564390
- Variables are included in the standard i, j, and by syntax
- env contains the (quoted) variable values
- i.e. argument values in the typical R udf syntax (
function(x = val1)
) - Can use other UDFs as values which is demonstrated by inner = “square”
- i.e. argument values in the typical R udf syntax (
Recipes
Operations covered in this section
- group_by, summarize (and arrange)
- crosstab
group_by
,summarize
(andarrange
)<- dtstudy[, .(n = .N, avg = round(mean(y), 1)), keyby = .(male, over65, rx)] dt_res <- tibble::as_tibble(dtstudy) tb_study <- tb_study |> tb_res summarize(n = n(), avg = round(mean(y), 1), .by = c(male, over65, rx)) |> arrange(male, over65, rx)
- dt automatically orders by the grouping variables, so to get the exact output, you have to add an
arrange
- dt automatically orders by the grouping variables, so to get the exact output, you have to add an
Crosstab using
cube
(Titanic5 dataset)# Note that the mean of a 0/1 variable is the proportion of 1s <- function(x) mean(x, na.rm=TRUE) mn # Create a function that counts the number of non-NA values <- function(x) sum(! is.na(x)) Nna cube(d, .(Proportion=mn(survived), N=Nna(survived)), by=.q(sex, class), id=TRUE) #> grouping sex class Proportion N #> 1: 0 female 1 0.9652778 144 #> 2: 0 male 1 0.3444444 180 #> 3: 0 male 2 0.1411765 170 #> 4: 0 female 2 0.8867925 106 #> 5: 0 male 3 0.1521298 493 #> 6: 0 female 3 0.4907407 216 #> 7: 1 female NA 0.7274678 466 #> 8: 1 male NA 0.1909846 843 #> 9: 2 <NA> 1 0.6203704 324 #> 10: 2 <NA> 2 0.4275362 276 #> 11: 2 <NA> 3 0.2552891 709 #> 12: 3 <NA> NA 0.3819710 1309