data.table
Misc
Syntax
DT[i, j, by] ## R: i j by ## SQL: where | order by select | update group by- Take data.table DT, subset rows using i, and manipulate columns with j, grouped according to by.
Resources
- Docs but it’s difficult to find anything.
- The philosophy of the package is highly dependent on syntax, so the reference page is not very useful in finding out how to perform certain operations as it usually is with other packages.
- The search doesn’t include the articles which contain a lot of information.
- Also, it’s an old package, and every old article, changelog, etc. is in the docs. So, if you find something you think answers your question, it may be that that syntax is outdated.
- Introduction to data.table (vignette)
- Syntax Reference (link)
- Symbol Reference (link)
- Docs but it’s difficult to find anything.
setDT(df)- Fast conversion of a data frame or list to a data.table without copying- Use when working with larger data sets that take up a considerable amount of RAM (several GBs) because the operation will modify each object in place, conserving memory.
as.data.table(matrix)should be used for matricesdat <- data.table(df)can be used for small datasets but there’s no reason to.setDT(copy(df))if you want to work with a copy of the df instead of converting the original object.
Chaining: see Pivoting >> melt >> Multiple variables stored in column names for an example
Piping
dt |> _[, do_stuff(column), by = group] |> _[, do_something_else(othr_col), by = othr_grp]The
_placeholder allows you to use R’s native pipe.Example
penguins[species == "Chinstrap"] |> _[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] # or penguins[species == "Chinstrap"] |> DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))
Symbols
.SDis a data.table containing the Subset of DT’s Data for each group, excluding any columns used inby(orkeyby). Its usage is still confusing to me.:=is the walrus operator.letis an alias. Think it acts likedplyr::mutateor maybedplyr::summarize. (Docs)DT[i, colC := mean(colB), by = colA] DT[i, `:=`(colC = sum(colB), colD = sum(colE)) by = colF] DT[i, let(colC = sum(colB), colD = sum(colE)), by = colF].Iis the row index. It’s an integer vector equal toseq_len(nrow(x))dt <- data.table( a = 1:3, b = 4:6 ) dt[, .(a, b, rowsum = sum(.SD)), by = .I] #> I a b rowsum #> <int> <int> <int> <int> #> 1: 1 1 4 5 #> 2: 2 2 5 7 #> 3: 3 3 6 9
Basic Usage
Using i
- We can subset rows similar to a
data.frame- except you don’t have to useDT$repetitively since columns within the frame of adata.tableare seen as if they are variables. - We can also sort a
data.tableusingorder(), which internally usesdata.table’s fast order for performance. - We can do much more in
iby keying adata.table, which allows blazing fast subsets and joins. We will see this in the “Keys and fast binary search based subsets” and “Joins and rolling joins” vignette.
Using j
- Select columns the
data.tableway:DT[, .(colA, colB)]. - Select columns the
data.frameway:DT[, c("colA", "colB")]. - Compute on columns:
DT[, .(sum(colA), mean(colB))]. - Provide names if necessary:
DT[, .(sA =sum(colA), mB = mean(colB))]. - Combine with
i:DT[colA > value, sum(colB)].
Using by
- Using
by, we can group by columns by specifying a list of columns or a character vector of column names or even expressions. The flexibility ofj, combined withbyandimakes for a very powerful syntax. bycan handle multiple columns and also expressions.- We can
keybygrouping columns to automatically sort the grouped result. - We can use
.SDand.SDcolsinjto operate on multiple columns using already familiar base functions. Here are some examples:DT[, lapply(.SD, fun), by = ..., .SDcols = ...]- appliesfunto all columns specified in.SDcolswhile grouping by the columns specified inby.DT[, head(.SD, 2), by = ...]- return the first two rows for each group.DT[col > val, head(.SD, 1), by = ...]- combineialong withjandby.
Columns
Rename Columns
setnames(DT, old = c("SIMD2020v2_Income_Domain_Rank", "SIMD2020_Employment_Domain_Rank", "SIMD2020_Health_Domain_Rank", "SIMD2020_Education_Domain_Rank", "SIMD2020_Access_Domain_Rank", "SIMD2020_Crime_Domain_Rank", "SIMD2020_Housing_Domain_Rank", "CP_Name"), new = c("Income", "Employment", "Health", "Education", "Access", "Crime", "Housing", "areaname"))
Filtering
Fast filtering mechanism; reorders rows (increasing) to group by the values in the key columns. Reordered rows make them easier to find and subset.
- All types of columns can be used except list and complex
Operations covered in this section
- Filtering
- Filter, select
- Filter, groupby, summarize
- If-Else
Set Keys - Says order in the increasing direction according to origin and then dest.
setkey(flights, origin, dest) head(flights) # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 2 -2 -25 EV EWR ALB 30 143 7 # 2: 2014 1 3 88 79 EV EWR ALB 29 143 23 # 3: 2014 1 4 220 211 EV EWR ALB 32 143 15 # 4: 2014 1 4 35 19 EV EWR ALB 32 143 7 # 5: 2014 1 5 47 42 EV EWR ALB 26 143 8 # 6: 2014 1 5 66 62 EV EWR ALB 31 143 23Filter by origin == “JFK” and dest == “MIA”
flights[.("JFK", "MIA")] # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 1 -1 -17 AA JFK MIA 161 1089 15 # 2: 2014 1 1 7 -8 AA JFK MIA 166 1089 9 # 3: 2014 1 1 2 -1 AA JFK MIA 164 1089 12 # 4: 2014 1 1 6 3 AA JFK MIA 157 1089 5 # 5: 2014 1 1 6 -12 AA JFK MIA 154 1089 17 # --- # 2746: 2014 10 31 -1 -22 AA JFK MIA 148 1089 16 # 2747: 2014 10 31 -3 -20 AA JFK MIA 146 1089 8 # 2748: 2014 10 31 2 -17 AA JFK MIA 150 1089 6 # 2749: 2014 10 31 -3 -12 AA JFK MIA 150 1089 5 # 2750: 2014 10 31 29 4 AA JFK MIA 146 1089 19Filter by only the first key column (origin):
flights["JFK"]Filter by only the second key column (dest)
flights[.(unique(), "MIA")] # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16 # 2: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6 # 3: 2014 1 1 -5 -8 AA EWR MIA 157 1085 11 # 4: 2014 1 1 43 42 UA EWR MIA 155 1085 15 # 5: 2014 1 1 60 49 UA EWR MIA 162 1085 21 # --- # 9924: 2014 10 31 -11 -8 AA LGA MIA 157 1096 13 # 9925: 2014 10 31 -5 -11 AA LGA MIA 150 1096 9 # 9926: 2014 10 31 -2 10 AA LGA MIA 156 1096 6 # 9927: 2014 10 31 -2 -16 AA LGA MIA 156 1096 19 # 9928: 2014 10 31 1 -11 US LGA MIA 164 1096 15Filter by origin and dest values, then summarize and pull maximum of arr_delay
flights[.("LGA", "TPA"), max(arr_delay)] # [1] 486Filter by three origin values, one dest value, return the last row for each match
flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last"] # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18 # 2: NA NA NA NA NA NA JFK XNA NA NA NA # 3: 2014 2 3 231 268 EV EWR XNA 184 1131 12- Filtering by more than one key value returns combinations of the first key and second key
- Remember setting a key reorders (increasing)
Summarize
Example: groupby state + min, max, mean
D[ ,.(mean = mean(measurement), min = min(measurement), max = max(measurement)), by=state] # Supposedly faster rbindlist(lapply(unique(D$state), \(x) data.table(state = x, y[state == x, .(mean(measurement), min(measurement), max(measurement)) ] )))Filter by origin and dest values, then select a arr.delay column:
flights[.("LGA", "TPA"), .(arr_delay)]Filter by origin value, group_by month, summarize( max(dep_delay))
ans <- flights["JFK", max(dep_delay), keyby = month] head(ans) # month V1 # 1: 1 881 # 2: 2 1014 # 3: 3 920 # 4: 4 1241 # 5: 5 853 # 6: 6 798 key(ans) # [1] "month"- keyby groups and sets the key to month
Joins
Left Equal Join
DT <- lookup[DT, on = .(DataZone = Data_Zone)] DT <- merge(lookup, DT, by.x = "DataZone", by.y = "Data_Zone")- DT: A datatable where the id column is Data_Zone
- lookup: A datatable where the id column is DataZone
- Both datatables have the same number of rows so that makes this an Equal Join
- DT is joined to lookup, so the columns of lookup appear first (farthest left) then DT’s columns (farthest right) of the joined datatable.
- Subset Notation: The output datatable has the id column, DataZone, which is from lookup but the rows are ordered the same way as the input table, DT.
- It’s weird that the output’s rows are ordered according to the input datatable
merge: The output datatable has the id column, DataZone, which is from lookup, and the rows are ordered according to lookup- The subset way is the “data.table” way, because you perform calculations on the output using the j position whereas with
merge, it would require a chain or an extra line of code. But if the order of rows of the output matters, then I can’t find a way to reproduce themergeordering using the subset method.
Conditionals
Ifelse using hour
setkey(flights, hour) # hour has values 0-24 flights[.(24), hour := 0L]ifelse(hour == 24, 0, TRUE)- Consequence: since a key column value has changed, hour is no longer a key
Pivoting
pivot_longer and melt
Basic
relig_income |> pivot_longer(!religion, # keep religion as a column names_to = "income", # desired name for new column values_to = "count") # what data goes into the new column? melt(DT, id.vars = "religion", variable.name = "income", value.name = "count", variable.factor = FALSE) # added to keep output consistent with tidyrColumns have a common prefix and missing values are dropped
billboard |> pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE ) melt(DT, measure.vars = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE)Multiple variables stored in column names
who <- data.table(id = 1, new_sp_m5564 = 2, newrel_f65 = 3) # id new_sp_m5564 newrel_f65 # <num> <num> <num> # 1: 1 2 3 melt(who, measure.vars = measure(diagnosis, gender, ages, pattern = "new_?(.*)_(.)(.*)")) # id diagnosis gender ages value # <num> <char> <char> <char> <num> # 1: 1 sp m 5564 2 # 2: 1 rel f 65 3 # with tidyr who |> tidyr::pivot_longer( cols = !id, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", values_to = "count") # # A tibble: 2 × 5 # id diagnosis gender age count # <dbl> <chr> <chr> <chr> <dbl> # 1 1 sp m 5564 2 # 2 1 rel f 65 3tstrsplitis DT’stidyr::separate
Matrix to long
anscombe |> pivot_longer( everything(), cols_vary = "slowest", names_to = c(".value", "set"), names_pattern = "(.)(.)" ) DT[,melt(.SD, variable.name = "set", value.name = c("x","y"), variable.factor = FALSE, measure.vars = patterns("^x","^y"))]
pivot_wider and dcast
Data in examples
fish_encounters
## # A tibble: 114 × 3 ## fish station seen ## <fct> <fct> <int> ## 1 4842 Release 1 ## 2 4842 I80_1 1 ## 3 4842 Lisbon 1 ## 4 4842 Rstr 1 ## 5 4842 Base_TD 1 ## 6 4842 BCE 1 ## 7 4842 BCW 1 ## 8 4842 BCE2 1 ## 9 4842 BCW2 1 ## 10 4842 MAE 1 ## # … with 104 more rows
Basic
fish_encounters |> pivot_wider(names_from = station, values_from = seen) dcast(DT, fish ~ station, value.var = "seen")Fill in missing values
fish_encounters |> pivot_wider(names_from = station, values_from = seen, values_fill = 0) dcast(DT, fish ~ station, value.var = "seen", fill = 0) # alt DT[, dcast(.SD, fish ~ station, value.var = "seen", fill = 0)]- Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
meltabove)
- Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
Generate column names from multiple variables
us_rent_income |> pivot_wider( names_from = variable, values_from = c(estimate, moe) ) dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe")) # alt dcast(DT, ... ~ variable, value.var = c("estimate","moe"))- Alternative: pass “…” to indicate all other unspecified columns
Specify a different names separator
us_rent_income |> pivot_wider( names_from = variable, names_sep = ".", values_from = c(estimate, moe) ) dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe"), sep = ".") # alt DT[, dcast(.SD, GEOID + NAME ~ variable, value.var = c("estimate","moe"), sep = ".")]- Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
meltabove)
- Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
Controlling how column names are combined
us_rent_income |> pivot_wider( names_from = variable, values_from = c(estimate, moe), names_vary = "slowest" ) |> names() DT[, dcast(.SD, GEOID + NAME ~ variable, value.var = c("estimate","moe")) ][,c(1:3,5,4,6)] |> names() ## [1] "GEOID" "NAME" "estimate_income" "moe_income" ## [5] "estimate_rent" "moe_rent"- See {tidyr::pivot_wider} docs and the names_vary arg
Aggregation
warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks, values_fn = mean ) dcast(DT, tension ~ wool, value.var = "breaks", fun = mean) # alt DT[, dcast(.SD, tension ~ wool, value.var = "breaks", fun = mean)] ## # A tibble: 3 × 3 ## tension A B ## <fct> <dbl> <dbl> ## 1 L 44.6 28.2 ## 2 M 24 28.8 ## 3 H 24.6 18.8- Alternative: Rather than have the DT inside
dcast, we can use .SD and havedcastinside DT, which is helpful for further chaining. (see applied tomeltabove)
- Alternative: Rather than have the DT inside
tidyr
separateviatstrsplitdt <- data.table(x = c("00531725 Male 2021 Neg", "07640613 Female 2020 Pos")) # x # <char> # 1: 00531725 Male 2021 Neg # 2: 07640613 Female 2020 Pos cols <- c("personID", "gender", "year", "covidTest") dt[, tstrsplit(x, split = " ", names = cols, type.convert = TRUE)] # personID gender year covidTest # <int> <char> <int> <char> # 1: 531725 Male 2021 Neg # 2: 7640613 Female 2020 Pos dt[, tstrsplit(x, split = " ", names = cols, type.convert = list(as.character = 1, as.factor = c(2, 4), as.integer = 3) )] # personID gender year covidTest # <char> <fctr> <int> <fctr> # 1: 00531725 Male 2021 Neg # 2: 07640613 Female 2020 Pos
User Defined Functions
env
iris_dt <- as.data.table(iris) square = function(x) x^2 iris_dt[filter_col %in% filter_val, .(var1, var2, out = outer(inner(var1) + inner(var2))), by = by_col, env = list( outer = "sqrt", inner = "square", var1 = "Sepal.Length", var2 = "Sepal.Width", out = "Sepal.Hypotenuse", filter_col = "Species", filter_val = I("versicolor"), by_col = "Species" )] |> head(n = 3) # Species Sepal.Length Sepal.Width Sepal.Hypotenuse # <fctr> <num> <num> <num> # 1: versicolor 7.0 3.2 7.696753 # 2: versicolor 6.4 3.2 7.155418 # 3: versicolor 6.9 3.1 7.564390- Variables are included in the standard i, j, and by syntax
- env contains the (quoted) variable values
- i.e. argument values in the typical R udf syntax (
function(x = val1)) - Can use other UDFs as values which is demonstrated by inner = “square”
- i.e. argument values in the typical R udf syntax (
Recipes
Operations covered in this section
- group_by, summarize (and arrange)
- crosstab
group_by,summarize(andarrange)dt_res <- dtstudy[, .(n = .N, avg = round(mean(y), 1)), keyby = .(male, over65, rx)] tb_study <- tibble::as_tibble(dtstudy) tb_res <- tb_study |> summarize(n = n(), avg = round(mean(y), 1), .by = c(male, over65, rx)) |> arrange(male, over65, rx)- dt automatically orders by the grouping variables, so to get the exact output, you have to add an
arrange
- dt automatically orders by the grouping variables, so to get the exact output, you have to add an
Crosstab using
cube(Titanic5 dataset)# Note that the mean of a 0/1 variable is the proportion of 1s mn <- function(x) mean(x, na.rm=TRUE) # Create a function that counts the number of non-NA values Nna <- function(x) sum(! is.na(x)) cube(d, .(Proportion=mn(survived), N=Nna(survived)), by=.q(sex, class), id=TRUE) #> grouping sex class Proportion N #> 1: 0 female 1 0.9652778 144 #> 2: 0 male 1 0.3444444 180 #> 3: 0 male 2 0.1411765 170 #> 4: 0 female 2 0.8867925 106 #> 5: 0 male 3 0.1521298 493 #> 6: 0 female 3 0.4907407 216 #> 7: 1 female NA 0.7274678 466 #> 8: 1 male NA 0.1909846 843 #> 9: 2 <NA> 1 0.6203704 324 #> 10: 2 <NA> 2 0.4275362 276 #> 11: 2 <NA> 3 0.2552891 709 #> 12: 3 <NA> NA 0.3819710 1309