data.table
Misc
Syntax
DT[i, j, by] ## R: i j by ## SQL: where | order by select | update group by
- Take data.table DT, subset rows using i, and manipulate columns with j, grouped according to by.
Packages
- {data.table.threads} - Finds the optimal/ideal speedup (efficiency factor) and thread count for each parallelizable function for your machine.
Resources
- Docs but it’s difficult to find anything.
- The philosophy of the package is highly dependent on syntax, so the reference page is not very useful in finding out how to perform certain operations as it usually is with other packages.
- The search doesn’t include the articles which contain a lot of information.
- Also, it’s an old package, and every old article, changelog, etc. is in the docs. So, if you find something you think answers your question, it may be that that syntax is outdated.
- Introduction to data.table (vignette)
- Function Reference (link)
- Syntax Reference (link)
- Symbol Reference (link)
- Docs but it’s difficult to find anything.
setDT(df)
- Fast conversion of a data frame or list to a data.table without copying- Use when working with larger data sets that take up a considerable amount of RAM (several GBs) because the operation will modify each object in place, conserving memory.
as.data.table(matrix)
should be used for matricesdat <- data.table(df)
can be used for small datasets but there’s no reason to.setDT(copy(df))
if you want to work with a copy of the df instead of converting the original object.
Chaining:
Syntax
DT[ ... ][ ... ][ ... ]
Piping
|> dt do_stuff(column), by = group] |> _[, do_something_else(othr_col), by = othr_grp] _[,
The
_
placeholder allows you to use R’s native pipe.Example
== "Chinstrap"] |> penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] _[ , .(# or == "Chinstrap"] |> penguins[species DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))
Symbols and Keywords
:=
is the walrus operator.let
is an alias. Think it acts likedplyr::mutate
or maybedplyr::summarize
. (Docs):= mean(colB), by = colA] DT[i, colC DT[i,`:=`(colC = sum(colB), colD = sum(colE)) = colF] by DT[i,let(colC = sum(colB), colD = sum(colE)), = colF] by
.EACHI
- See Joins >> Joining and Aggregating for examples
- When i is a list (or data.frame or data.table),
DT[i, j, by = .EACHI]
evaluates j for the groups inDT
that each row in i joins to. - That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i
.I
is the row index. It’s an integer vector equal toseq_len(nrow(x))
<- data.table( dt a = 1:3, b = 4:6 )rowsum = sum(.SD)), by = .I] dt[, .(a, b, #> I a b rowsum #> <int> <int> <int> <int> #> 1: 1 1 4 5 #> 2: 2 2 5 7 #> 3: 3 3 6 9
.N
holds the number of observations in the current group. It is particularly useful when combined withby
Seems to act like
nrow
for a df orlength
for a vector and is commonly used for counting the number of rowsIf
.N
is not wrapped withlist
or.()
(returns df), a vector is returnedIf a name is not provided for the column returned in
j
, it’s named N automatically by recognizing the special symbol.N
.Example: Count filtered rows
<- flights[origin == "JFK" & month == 6L, .N] ans ans#> [1] 8422
- In the absence of group
by
operations, it simply returns the number of rows in the subset.
- In the absence of group
Example: Group by + count
<- flights[, .(.N), by = .(origin)] ans ans#> origin N #> <char> <int> #> 1: JFK 81483 #> 2: LGA 84433 #> 3: EWR 87400 ## or equivalently using a character vector in 'by' # ans <- flights[, .(.N), by = "origin"] ## or equivalently when there's only 1 col in j and by # ans <- flights[, .N, by = origin]
.NATURAL
is a join keywordSelects the columns to perform the match based on common column names in both tables.
Example:
= .NATURAL] Products[ProductReceived, on
- Only the keyword has to be specified and not any joining variables
.SD
is a data.table containing the Subset of DT’s Data for each group, excluding any columns used inby
(orkeyby
).If you don’t want to act on every column accept the grouping column (by), use the .SDcols to select the set of columns you want.
!
can be used to not select columns:.SDcols = !c("unit", "type")
See Basic Usage >> Using by for more details
Example: Calculate the mean on multiple columns grouped by multiple columns
1== "AA", flights[carrier 2lapply(.SD, mean), 3= .(origin, dest, month), by 4= c("arr_delay", "dep_delay")] .SDcols #> origin dest month arr_delay dep_delay #> <char> <char> <int> <num> <num> #> 1: JFK LAX 1 6.590361 14.2289157 #> 2: LGA PBI 1 -7.758621 0.3103448 #> 3: EWR LAX 1 1.366667 7.5000000 #> 4: JFK MIA 1 15.720670 18.7430168 #> 5: JFK SEA 1 14.357143 30.7500000 #> --- #> 196: LGA MIA 10 -6.251799 -1.4208633 #> 197: JFK MIA 10 -1.880184 6.6774194 #> 198: EWR PHX 10 -3.032258 -4.2903226 #> 199: JFK MCO 10 -10.048387 -1.6129032 #> 200: JFK DCA 10 16.483871 15.5161290
- 1
- Filter on trips
- 2
- Compute the mean on every column that isn’t a grouping column
- 3
- Group the calculation by origin, dest, month
- 4
-
Actually not on all the non-grouping columns in
.SD
, just these ones
Basic Usage
Using i
- We can subset rows similar to a
data.frame
- except you don’t have to useDT$
repetitively since columns within the frame of adata.table
are seen as if they are variables. - We can also sort a
data.table
usingorder()
, which internally usesdata.table
’s fast order for performance. - We can do much more in
i
by keying adata.table
, which allows blazing fast subsets and joins. We will see this in the “Keys and fast binary search based subsets” and “Joins and rolling joins” vignette.
Using j
- Select columns the
data.table
way:DT[, .(colA, colB)]
. - Select columns the
data.frame
way:DT[, c("colA", "colB")]
. - Compute on columns:
DT[, .(sum(colA), mean(colB))]
. - Provide names if necessary:
DT[, .(sA =sum(colA), mB = mean(colB))]
. - Combine with
i
:DT[colA > value, sum(colB)]
.
Using by
- Using
by
, we can group by columns by specifying a list of columns or a character vector of column names or even expressions. The flexibility ofj
, combined withby
andi
makes for a very powerful syntax. by
can handle multiple columns and also expressions.- We can
keyby
grouping columns to automatically sort the grouped result. - We can use
.SD
and.SDcols
inj
to operate on multiple columns using already familiar base functions. Here are some examples:DT[, lapply(.SD, fun), by = ..., .SDcols = ...]
- appliesfun
to all columns specified in.SDcols
while grouping by the columns specified inby
.DT[, head(.SD, 2), by = ...]
- return the first two rows for each group.DT[col > val, head(.SD, 1), by = ...]
- combinei
along withj
andby
.
Columns
Rename Columns
setnames(DT, old = c("SIMD2020v2_Income_Domain_Rank", "SIMD2020_Employment_Domain_Rank", "SIMD2020_Health_Domain_Rank", "SIMD2020_Education_Domain_Rank", "SIMD2020_Access_Domain_Rank", "SIMD2020_Crime_Domain_Rank", "SIMD2020_Housing_Domain_Rank", "CP_Name"), new = c("Income", "Employment", "Health", "Education", "Access", "Crime", "Housing", "areaname"))
Get the first two rows of a column
<- flights[, head(.SD, 2), by = month] ans head(ans) # month year day dep_delay arr_delay carrier origin dest air_time distance hour # <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int> # 1: 1 2014 1 14 13 AA JFK LAX 359 2475 9 # 2: 1 2014 1 -3 13 AA JFK LAX 363 2475 11 # 3: 2 2014 1 -1 1 AA JFK LAX 358 2475 8 # 4: 2 2014 1 -5 3 AA JFK LAX 358 2475 11 # 5: 3 2014 1 -11 36 AA JFK LAX 375 2475 8 # 6: 3 2014 1 -3 14 AA JFK LAX 368 2475 11
Filtering
Fast filtering mechanism; reorders rows (increasing) to group by the values in the key columns. Reordered rows make them easier to find and subset.
- All types of columns can be used except list and complex
Operations covered in this section
- Filtering
- Filter, select
- Filter, groupby, summarize
- If-Else
Set Keys - Says order in the increasing direction according to origin and then dest.
setkey(flights, origin, dest) head(flights) # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 2 -2 -25 EV EWR ALB 30 143 7 # 2: 2014 1 3 88 79 EV EWR ALB 29 143 23 # 3: 2014 1 4 220 211 EV EWR ALB 32 143 15 # 4: 2014 1 4 35 19 EV EWR ALB 32 143 7 # 5: 2014 1 5 47 42 EV EWR ALB 26 143 8 # 6: 2014 1 5 66 62 EV EWR ALB 31 143 23
Filter by origin == “JFK” and dest == “MIA”
"JFK", "MIA")] flights[.(# year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 1 -1 -17 AA JFK MIA 161 1089 15 # 2: 2014 1 1 7 -8 AA JFK MIA 166 1089 9 # 3: 2014 1 1 2 -1 AA JFK MIA 164 1089 12 # 4: 2014 1 1 6 3 AA JFK MIA 157 1089 5 # 5: 2014 1 1 6 -12 AA JFK MIA 154 1089 17 # --- # 2746: 2014 10 31 -1 -22 AA JFK MIA 148 1089 16 # 2747: 2014 10 31 -3 -20 AA JFK MIA 146 1089 8 # 2748: 2014 10 31 2 -17 AA JFK MIA 150 1089 6 # 2749: 2014 10 31 -3 -12 AA JFK MIA 150 1089 5 # 2750: 2014 10 31 29 4 AA JFK MIA 146 1089 19
Filter by only the first key column (origin):
flights["JFK"]
Filter by only the second key column (dest)
unique(), "MIA")] flights[.(# year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16 # 2: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6 # 3: 2014 1 1 -5 -8 AA EWR MIA 157 1085 11 # 4: 2014 1 1 43 42 UA EWR MIA 155 1085 15 # 5: 2014 1 1 60 49 UA EWR MIA 162 1085 21 # --- # 9924: 2014 10 31 -11 -8 AA LGA MIA 157 1096 13 # 9925: 2014 10 31 -5 -11 AA LGA MIA 150 1096 9 # 9926: 2014 10 31 -2 10 AA LGA MIA 156 1096 6 # 9927: 2014 10 31 -2 -16 AA LGA MIA 156 1096 19 # 9928: 2014 10 31 1 -11 US LGA MIA 164 1096 15
Filter by origin and dest values, then summarize and pull maximum of arr_delay
"LGA", "TPA"), max(arr_delay)] flights[.(# [1] 486
Filter by three origin values, one dest value, return the last row for each match
c("LGA", "JFK", "EWR"), "XNA"), mult = "last"] flights[.(# year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18 # 2: NA NA NA NA NA NA JFK XNA NA NA NA # 3: 2014 2 3 231 268 EV EWR XNA 184 1131 12
- Filtering by more than one key value returns combinations of the first key and second key
- Remember setting a key reorders (increasing)
Count
Also see
.N
in Misc >> Symbols for basic examplesExample: Group by multiple variables and count
<- ans == "AA", flights[carrier .N, = .(origin, dest)] by head(ans) #> origin dest N #> <char> <char> <int> #> 1: JFK LAX 3387 #> 2: LGA PBI 245 #> 3: EWR LAX 62 #> 4: JFK MIA 1876 #> 5: JFK SEA 298 #> 6: EWR MIA 848
Example: Order the group by + count
<- ans == "AA", flights[carrier .N, = .(origin, dest)][order(origin, -dest)] by head(ans, 10) #> origin dest N #> <char> <char> <int> #> 1: EWR PHX 121 #> 2: EWR MIA 848 #> 3: EWR LAX 62 #> 4: EWR DFW 1618 #> 5: JFK STT 229 #> 6: JFK SJU 690 #> 7: JFK SFO 1312 #> 8: JFK SEA 298 #> 9: JFK SAN 299 #> 10: JFK ORD 432
- The
-dest
says when ordering by dest, do it in descending order
- The
Example: Count by condition
<- flights[, .N, .(dep_delay>0, arr_delay>0)] ans ans#> dep_delay arr_delay N #> <lgcl> <lgcl> <int> #> 1: TRUE TRUE 72836 #> 2: FALSE TRUE 34583 #> 3: FALSE FALSE 119304 #> 4: TRUE FALSE 26593
- Other columns can be included in the expression ,
DT[, .N, by = .(a, b>0)]
- Other columns can be included in the expression ,
Summarize
Example: groupby state + min, max, mean
mean = mean(measurement), D[ ,.(min = min(measurement), max = max(measurement)), =state] by # Supposedly faster rbindlist(lapply(unique(D$state), data.table(state = x, \(x) == x, y[state mean(measurement), .(min(measurement), max(measurement)) ] )))
Filter by origin and dest values, then select a arr.delay column:
flights[.("LGA", "TPA"), .(arr_delay)]
Filter by origin value, group_by month, summarize( max(dep_delay))
<- flights["JFK", max(dep_delay), keyby = month] ans head(ans) # month V1 # 1: 1 881 # 2: 2 1014 # 3: 3 920 # 4: 4 1241 # 5: 5 853 # 6: 6 798 key(ans) # [1] "month"
- keyby groups and sets the key to month
Across
# Across all columns names(.SD) := lapply(.SD, fun)] DT[, # Across all numeric columns names(.SD) := lapply(.SD, fun), .SDcols = is.numeric] DT[,
Joins
Misc
Syntax
- 1
- x is the secondary data.table, i is the primary data.table, list, or dataframe
- 2
-
on takes a character vector with the common variables to be joined on (e.g.
c(var1, var2)
) or list defining the matching logic, e.g.list(var1 = var2)
- 3
- If nomatch = NULL, only rows linked to the x and i tables are returned
- For join operations, it is recommended to pass the on and nomatch arguments by name to avoid using j and by when they are not needed.
See Misc>> Symbols >>
.Natural
for details on a Natural joinKeyed Join is a join where both tables have the same columns defined as keys
Products[ProductReceived]
- No join variables have to be specified
Left Join
ProductReceived[Products,= list(product_id = id)] on
- All rows in ProductReceived are kept
- product_id belongs to ProductReceived and id belongs to Products
- For a Right Join, order of the tables and join variables are reversed
Inner Join
# First Table Products[ProductReceived,= c("id" = "product_id"), on = NULL] nomatch
- Keeps only rows matched in both tables
- id belongs to Products and product_id to ProductReceived
Chaining Joins
NewTax[Products,1= c("unit", "type") on ][, ProductReceived[.SD,2= list(product_id = id)], on 3= !c("unit", "type")] .SDcols # id date product_id count tax_prop name price # <int> <IDat> <int> <int> <num> <char> <num> # 1: 2 2024-01-15 1 100 NA banana 0.63 # 2: 4 2024-01-29 1 150 NA banana 0.63 # 3: 6 2024-02-12 1 150 NA banana 0.63 # 4: 9 2024-03-04 1 100 NA banana 0.63 # 5: 5 2024-02-05 2 50 NA carrots 0.89 # 6: 7 2024-02-19 2 150 NA carrots 0.89 # 7: 8 2024-02-26 2 100 NA carrots 0.89 # 8: 10 2024-03-11 3 150 0.65 popcorn 2.99 # 9: NA <NA> 4 NA 0.20 soda 1.49 # 10: 1 2024-01-08 NA 150 0.65 toothpaste 2.99
- 1
- First, the NewTax and Products tables are joined two common variables: unit and type
- 2
- Then, the resulting table is joined to ProductRecieved where product_id belongs to ProductReceived and id belongs to previously joined table (originally Products)
- 3
-
The
.SD
keyword refers to the previously joined table’s columns. .SDcols subsets those columns by removing unit and type.
Non-Equi Join
Where the condition for matching rows is not based on equality, but on other comparison operators like <, >, <=, or >=
Example: How many of product “2” were received before a certain date
= ProductSales[product_id == 2L] ProductSalesProd2 #> id date product_id count #> <int> <IDat> <int> <int> #> 2: 2 2024-01-11 2 150 #> 7: 7 2024-02-06 2 150 = ProductReceived[product_id == 2L] ProductReceivedProd2 #> id date product_id count #> <int> <IDat> <int> <int> #> 5: 5 2024-02-05 2 50 #> 7: 7 2024-02-19 2 150 #> 8: 8 2024-02-26 2 100 # ProductReceivedProd2[ProductSalesProd2, # on = "product_id", # allow.cartesian = TRUE] #> id date product_id count i.id i.date i.count #> <int> <IDat> <int> <int> <int> <IDat> <int> #> 1: 5 2024-02-05 2 50 2 2024-01-11 150 #> 2: 7 2024-02-19 2 150 2 2024-01-11 150 #> 3: 8 2024-02-26 2 100 2 2024-01-11 150 #> 4: 5 2024-02-05 2 50 7 2024-02-06 150 #> 5: 7 2024-02-19 2 150 7 2024-02-06 150 #> 6: 8 2024-02-26 2 100 7 2024-02-06 150 ProductReceivedProd2[ProductSalesProd2,= "product_id", on = TRUE allow.cartesian < i.date] ][date #> id date product_id count i.id i.date i.count #> <int> <IDat> <int> <int> <int> <IDat> <int> #> 1: 5 2024-02-05 2 50 7 2024-02-06 150 # or equivalently ProductReceivedProd2[ProductSalesProd2,= list(product_id, on < date), date = NULL] nomatch #> id date product_id count i.id i.count #> <int> <IDat> <int> <int> <int> <int> #> 1: 5 2024-02-06 2 50 7 150
- To show the process of how the join is made,
allow.cartesian
is set the TRUE which allows the combining of each row from one table with every row from the other table which is called a Cross Join- Only use with subsetted tables as necessarily results in duplicated values and thus a larger table than the original
- The middle table with the code that I’ve commented out shows the results of this.
- It allows us to compare variable values of both tables given a common variable value
- In this example, product sales dates (date) and product received dates (i.date) are compared given
product_id == 2
- nomatch = NULL is specified because this is a right-join and some rows with NA values can be returned
- To show the process of how the join is made,
Joining and Aggregating with
.EACHI
See Misc >> Symbols and Keywords >>
EACHI
for a descriptionEACHI
requires j to be specified which is where the aggregating takes placeExample: Group by and summarize
= ProductReceived[ dt1 Products,= c("product_id" = "id"), on = .EACHI, by = .(total_value_received = sum(price * count)) j ] # or equivalently = ProductReceived[ dt2 Products,= c("product_id" = "id"), on ][, total_value_received = sum(price * count)), .(= "product_id" by ]
- Joins the two tables, groups by the joining variable, and summarizes with a calculation.
Example: Non-Equi (source)
tab1[tab2,= .(cat_var, on >= start_date_tab2, start_date_tab1 < fin_date_tab2), start_date_tab1 = .EACHI, by = .N] j
- Joins tables tab1 and tab2
- Filters the data according to date variables from both tables
- Counts each category in cat_var in the filtered data
- cat_var should be present both tables (I think)
Rolling Join
Allows you to match rows based on the nearest value in a sorted column, typically a date or time column.
This is valuable when you need to align data from different sources that may not have exactly matching timestamps, or when you want to carry forward the most recent value.
Example: In financial data, you might use a rolling join to assign the most recent stock price to each transaction, even if the price updates and transactions don’t occur at the exact same times.
Example: Match sales with the most recent product information.
ProductPriceHistory[ProductSales,= .(product_id, date), on = TRUE, roll = .(product_id, date, count, price)] j # product_id date count price # <int> <IDat> <int> <num> # 1: 7 2024-01-08 50 NA # 2: 2 2024-01-11 150 0.79 # 3: 1 2024-01-18 50 0.59 # 4: 1 2024-01-25 100 0.59 # 5: 3 2024-01-26 100 NA # 6: 3 2024-02-02 150 NA # 7: 2 2024-02-06 150 0.89 # 8: 7 2024-02-15 150 NA # 9: 1 2024-02-27 150 0.63 # 10: 1 2024-03-08 50 0.65
- product_id and date are common to both tables
.()
is used instead ofc()
for some reason
- j is explicitly used to select specific columns
- To see only the matching cases we would need to add the argument nomatch = NULL to perform an inner rolling join.
- product_id and date are common to both tables
Conditionals
Ifelse using hour
setkey(flights, hour) # hour has values 0-24 24), hour := 0L] flights[.(
ifelse(hour == 24, 0, TRUE)
- Consequence: since a key column value has changed, hour is no longer a key
Pivoting
pivot_longer and melt
Basic
|> relig_income pivot_longer(!religion, # keep religion as a column names_to = "income", # desired name for new column values_to = "count") # what data goes into the new column? melt(DT, id.vars = "religion", variable.name = "income", value.name = "count", variable.factor = FALSE) # added to keep output consistent with tidyr
Columns have a common prefix and missing values are dropped
|> billboard pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE )melt(DT, measure.vars = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE)
Multiple variables stored in column names
<- data.table(id = 1, new_sp_m5564 = 2, newrel_f65 = 3) who # id new_sp_m5564 newrel_f65 # <num> <num> <num> # 1: 1 2 3 melt(who, measure.vars = measure(diagnosis, gender, ages,pattern = "new_?(.*)_(.)(.*)")) # id diagnosis gender ages value # <num> <char> <char> <char> <num> # 1: 1 sp m 5564 2 # 2: 1 rel f 65 3 # with tidyr |> who ::pivot_longer( tidyrcols = !id, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", values_to = "count") # # A tibble: 2 × 5 # id diagnosis gender age count # <dbl> <chr> <chr> <chr> <dbl> # 1 1 sp m 5564 2 # 2 1 rel f 65 3
tstrsplit
is DT’stidyr::separate
Matrix to long
|> anscombe pivot_longer( everything(), cols_vary = "slowest", names_to = c(".value", "set"), names_pattern = "(.)(.)" )melt(.SD, DT[,variable.name = "set", value.name = c("x","y"), variable.factor = FALSE, measure.vars = patterns("^x","^y"))]
pivot_wider and dcast
Data in examples
fish_encounters
## # A tibble: 114 × 3 ## fish station seen ## <fct> <fct> <int> ## 1 4842 Release 1 ## 2 4842 I80_1 1 ## 3 4842 Lisbon 1 ## 4 4842 Rstr 1 ## 5 4842 Base_TD 1 ## 6 4842 BCE 1 ## 7 4842 BCW 1 ## 8 4842 BCE2 1 ## 9 4842 BCW2 1 ## 10 4842 MAE 1 ## # … with 104 more rows
Basic
|> fish_encounters pivot_wider(names_from = station, values_from = seen) dcast(DT, fish ~ station, value.var = "seen")
Fill in missing values
|> fish_encounters pivot_wider(names_from = station, values_from = seen, values_fill = 0) dcast(DT, fish ~ station, value.var = "seen", fill = 0) # alt dcast(.SD, fish ~ station, value.var = "seen", fill = 0)] DT[,
- Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
melt
above)
- Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
Generate column names from multiple variables
|> us_rent_income pivot_wider( names_from = variable, values_from = c(estimate, moe) ) dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe")) # alt dcast(DT, ... ~ variable, value.var = c("estimate","moe"))
- Alternative: pass “…” to indicate all other unspecified columns
Specify a different names separator
|> us_rent_income pivot_wider( names_from = variable, names_sep = ".", values_from = c(estimate, moe) ) dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe"), sep = ".") # alt dcast(.SD, GEOID + NAME ~ variable, DT[, value.var = c("estimate","moe"), sep = ".")]
- Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
melt
above)
- Alternative: Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. (see applied to
Controlling how column names are combined
|> us_rent_income pivot_wider( names_from = variable, values_from = c(estimate, moe), names_vary = "slowest" |> names() ) dcast(.SD, GEOID + NAME ~ variable, DT[, value.var = c("estimate","moe")) c(1:3,5,4,6)] |> names() ][, ## [1] "GEOID" "NAME" "estimate_income" "moe_income" ## [5] "estimate_rent" "moe_rent"
- See {tidyr::pivot_wider} docs and the names_vary arg
Aggregation
%>% warpbreaks pivot_wider( names_from = wool, values_from = breaks, values_fn = mean )dcast(DT, tension ~ wool, value.var = "breaks", fun = mean) # alt dcast(.SD, tension ~ wool, DT[, value.var = "breaks", fun = mean)] ## # A tibble: 3 × 3 ## tension A B ## <fct> <dbl> <dbl> ## 1 L 44.6 28.2 ## 2 M 24 28.8 ## 3 H 24.6 18.8
- Alternative: Rather than have the DT inside
dcast
, we can use .SD and havedcast
inside DT, which is helpful for further chaining. (see applied tomelt
above)
- Alternative: Rather than have the DT inside
tidyr
separate
viatstrsplit
<- data.table(x = c("00531725 Male 2021 Neg", "07640613 Female 2020 Pos")) dt # x # <char> # 1: 00531725 Male 2021 Neg # 2: 07640613 Female 2020 Pos <- c("personID", "gender", "year", "covidTest") cols tstrsplit(x, dt[, split = " ", names = cols, type.convert = TRUE)] # personID gender year covidTest # <int> <char> <int> <char> # 1: 531725 Male 2021 Neg # 2: 7640613 Female 2020 Pos tstrsplit(x, dt[, split = " ", names = cols, type.convert = list(as.character = 1, as.factor = c(2, 4), as.integer = 3) )]# personID gender year covidTest # <char> <fctr> <int> <fctr> # 1: 00531725 Male 2021 Neg # 2: 07640613 Female 2020 Pos
User Defined Functions
env
<- as.data.table(iris) iris_dt = function(x) x^2 square %in% filter_val, iris_dt[filter_col out = outer(inner(var1) + inner(var2))), .(var1, var2, = by_col, by = list( env outer = "sqrt", inner = "square", var1 = "Sepal.Length", var2 = "Sepal.Width", out = "Sepal.Hypotenuse", filter_col = "Species", filter_val = I("versicolor"), by_col = "Species" |> )] head(n = 3) # Species Sepal.Length Sepal.Width Sepal.Hypotenuse # <fctr> <num> <num> <num> # 1: versicolor 7.0 3.2 7.696753 # 2: versicolor 6.4 3.2 7.155418 # 3: versicolor 6.9 3.1 7.564390
- Variables are included in the standard i, j, and by syntax
- env contains the (quoted) variable values
- i.e. argument values in the typical R udf syntax (
function(x = val1)
) - Can use other UDFs as values which is demonstrated by inner = “square”
- i.e. argument values in the typical R udf syntax (
Recipes
Operations covered in this section
- group_by, summarize (and arrange)
- crosstab
group_by
,summarize
(andarrange
)<- dtstudy[, .(n = .N, avg = round(mean(y), 1)), keyby = .(male, over65, rx)] dt_res <- tibble::as_tibble(dtstudy) tb_study <- tb_study |> tb_res summarize(n = n(), avg = round(mean(y), 1), .by = c(male, over65, rx)) |> arrange(male, over65, rx)
- dt automatically orders by the grouping variables, so to get the exact output, you have to add an
arrange
- dt automatically orders by the grouping variables, so to get the exact output, you have to add an
Crosstab using
cube
(Titanic5 dataset)# Note that the mean of a 0/1 variable is the proportion of 1s <- function(x) mean(x, na.rm=TRUE) mn # Create a function that counts the number of non-NA values <- function(x) sum(! is.na(x)) Nna cube(d, .(Proportion=mn(survived), N=Nna(survived)), by=.q(sex, class), id=TRUE) #> grouping sex class Proportion N #> 1: 0 female 1 0.9652778 144 #> 2: 0 male 1 0.3444444 180 #> 3: 0 male 2 0.1411765 170 #> 4: 0 female 2 0.8867925 106 #> 5: 0 male 3 0.1521298 493 #> 6: 0 female 3 0.4907407 216 #> 7: 1 female NA 0.7274678 466 #> 8: 1 male NA 0.1909846 843 #> 9: 2 <NA> 1 0.6203704 324 #> 10: 2 <NA> 2 0.4275362 276 #> 11: 2 <NA> 3 0.2552891 709 #> 12: 3 <NA> NA 0.3819710 1309