DuckDB
Misc
- Also see
- High performance embedded database for analytics which provides a few enhancements over SQLite such as increased speed and allowing a larger number of columns
- Column storage is optimal for analytics since only certain columns are typically required in a query.
- Faster than sqlite for most analytics queries (sums, aggregates etc).
- Vectorizes query executions (columnar-oriented), while other DBMSs (SQLite, PostgreSQL…) process each row sequentially
- DuckDB’s memory usage is more optimal for analytics
- DuckDB only loads vectorized columns of data it needs to process at a given moment (i.e. lazy loading), which helps manage memory efficiently.
- If your data exceeds available RAM, DuckDB will intelligently manage memory and use disk as needed, though this might result in slower performance compared to fully in-memory operations
- Relational DBs (e.g. Postgres) bring row-based data pages into memory using a caching mechanism.
- When a query accesses data, it retrieves entire data pages (which may include rows you don’t need) into the cache. These pages are stored in memory to avoid repeated disk access.
- This method can handle large datasets, the process may involve more disk I/O and memory consumption due to the nature of page-level data loading.
- DuckDB only loads vectorized columns of data it needs to process at a given moment (i.e. lazy loading), which helps manage memory efficiently.
- Unlike some other big data tools it is entirely self-contained. (aka embedded, in-process)
- No external dependencies, or server software to install, update, or maintain
- Can directly run queries on Parquet files, CSV files, SQLite files, postgres files, Pandas, R and Julia data frames as well as Apache Arrow sources
- Resources
- Awesome DuckDB - Curated list of libraries, tools, and resources.
- Cooking with DuckDB - hrbrmstr’s recipes
- Extensions
- Packages
- Tools
- SQL Workbench - Query parquet files locally or remotely. Can also produce charts of results. Uses DuckDB-WASM so browser based.
- Tutorial - Along with explaining the features of the tool, it has complete normalization example and analysis.
- For visualizations, click the configure button on the right side of the Results sections (bottom main), click Data Grid, choose a chart type, drag column names from the bottom to various areas (similar to Tableau). Click the Reset button in the toolbar close to the configure button to return to Table mode.
- For tables, if you right-click their name in the Schema pane (far-left), you get a list of options including Summarize which gives summary stats along with uniques and null % for missing data.
- If tables have foreign keys, data models can be visualized in a mermaid diagram by clicking Data Modes in the bottom-left of the schema panel
- SQL Workbench - Query parquet files locally or remotely. Can also produce charts of results. Uses DuckDB-WASM so browser based.
Set-up
- Installation:
install.packages("duckdb")
- Check version:
duckdb:::sql("SELECT version()")
- Create db and populate table from csv
Example
Example
# includes filename/id ::with_dir("data-raw/files/", { withrdbSendQuery( " con, CREATE TABLE files AS SELECT *, regexp_extract(filename, '\\d{7}') AS file_number FROM read_csv_auto('*Control*File-*.txt', FILENAME = TRUE);" ) })
CLI
Start interactive shell:
duckdb
Start interactive shell on database file:
duckdb path\to\file
Query csv, json, or parquet file directly
duckdb -c "SELECT * FROM 'data_source.[csv|csv.gz|json|json.gz|parqet]'"
Run SQL script:
duckdb -c ".read path\to\script.sql"
d/dbplyr
Example: Connect, Read in Parallel, and Summarize
<- con dbConnect(duckdb(), ":memory:") <- df ::tbl(con, dplyrpaste0("read_csv('", file_name,"', parallel = true, delim = ',', header = true, columns = { 'measurement': 'DOUBLE', 'state': 'VARCHAR' })"), check_from = FALSE) <- df |> df summarize( .by = state, mean = mean(measurement), min = min(measurement), max = max(measurement) |> ) collect() <- NULL df dbDisconnect(con, shutdown = TRUE) gc()
- Competative with running the operation in SQL
Example Connect to db; Write a df to table; Query it
library(dbplyr) = DBI::dbConnect(duckdb::duckdb(), dbdir="duck.db", read_only=FALSE) duck ::dbWriteTable(duck, name = "sales", value = sales) DBI<- tbl(duck, "sales") sales_duck %>% sales_duck group_by(year, SKU) %>% mutate(pos_sales = case_when( > 0 ~ sales_units, sales_units TRUE ~ 0)) %>% summarize(total_revenue = sum(sales_units * item_price_eur), max_order_price = max(pos_sales * item_price_eur), avg_price_SKU = mean(item_price_eur), items_sold = n()) ::dbDisconnect(duck) DBI
duckplyr
Misc
- Currently can’t connect to a duckdb database directly (See dbplyr section). Only uses duckdb as a backend to dplyr for files (e.g. parquet, csv) that have been read and virtualized or an in-frame dataframe.
- Using
library(duckplyr)
will overwrite all {dplyr} verbs to the duckplyr versions- So any {dplyr} features not covered by {duckplyr} won’t work
methods_restore
will return verbs to standard dplyr functions
Read files (not in memory)
<- duckplyr_df_from_parquet("flights.parquet") duckplyr_df <- duckplyr_df_from_csv("Data8277.csv") dat
- Detects the schemas
- Creates placeholder objects
Collect Results
With
library(duckplyr)
library(duckplyr) <- dat ::penguins |> palmerpenguinsmutate(bill_area = bill_length_mm * bill_depth_mm) |> as_duckplyr_tibble()
as_duckplyr_tibble
triggers the usage of DuckDB for the calculation- If the DuckDB calculation is not possible, then original dplyr implementation is used
Or use
glimpse
,print
ornrow
if you don’t want to save the results to an object
Without
library(duckplyr)
you can still use duckdb for a backend.<- out ::penguins %>% palmerpenguins# CAVEAT: factor columns are not supported yet mutate(across(where(is.factor), as.character)) %>% ::as_duckplyr_tibble() %>% duckplyrmutate(bill_area = bill_length_mm * bill_depth_mm) %>% summarize(.by = c(species, sex), mean_bill_area = mean(bill_area)) %>% filter(species != "Gentoo")
as_duckplyr_tibble
triggers the usage of DuckDB for the calculation, but verbs after that line function like normal {dplyr} verbs.
Show query plan with
explain
library("duckplyr") as_duckplyr_df(data.frame(n=1:10)) |> mutate(m = n + 1) |> filter (m > 5) |> count() |> explain() #> ┌───────────────────────────┐ #> │ PROJECTION │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ n │ #> └─────────────┬─────────────┘ #> ┌─────────────┴─────────────┐ #> │ UNGROUPED_AGGREGATE │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ count_star() │ #> └─────────────┬─────────────┘ #> ┌─────────────┴─────────────┐ #> │ FILTER │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │(+(CAST(n AS DOUBLE), 1.0) │ #> │ > 5.0) │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ EC: 10 │ #> └─────────────┬─────────────┘ #> ┌─────────────┴─────────────┐ #> │ R_DATAFRAME_SCAN │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ data.frame │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ n │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ EC: 10 │ #> └───────────────────────────┘
- Query plan consists of a scan, a filter, projections and an aggregate.
Expressions that aren’t support won’t have “Projection” in the query plan
as_duckplyr_df(data.frame(n=1:10)) |> mutate(m=(\(x) x+1)(n)) |> explain() #> ┌───────────────────────────┐ #> │ R_DATAFRAME_SCAN │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ data.frame │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ n │ #> │ m │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ EC: 10 │ #> └───────────────────────────┘
- The translation of the anonymous function failed which caused it to fallback to the computation happening in the R engine.
- There will usually be a performance hit from the fallback due to – for example – the lack of automatic parallelization
Apache Arrow
to_duckdb()
andto_arrow()
: Converts between using {arrow} engine and {duckdb} engieg in workflow without paying any cost to (re)serialize the data when you pass it back and forth- Useful in cases where something is supported in one of Arrow or DuckDB but not the other
Benefits
- Utilization of a parallel vectorized execution engine without requiring any extra data copying
- Larger Than Memory Analysis: Since both libraries support streaming query results, we are capable of executing on data without fully loading it from disk. Instead, we can execute one batch at a time. This allows us to execute queries on data that is bigger than memory.
- Complex Data Types: DuckDB can efficiently process complex data types that can be stored in Arrow vectors, including arbitrarily nested structs, lists, and maps.
- Advanced Optimizer: DuckDB’s state-of-the-art optimizer can push down filters and projections directly into Arrow scans. As a result, only relevant columns and partitions will be read, allowing the system to e.g., take advantage of partition elimination in Parquet files. This significantly accelerates query execution.
Example (using a SQL Query; method 1)
# open dataset <- arrow::open_dataset(dir_out, partitioning = "species") ds # open connection to DuckDB <- dbConnect(duckdb::duckdb()) con # register the dataset as a DuckDB table, and give it a name ::duckdb_register_arrow(con, "my_table", ds) duckdb# query dbGetQuery(con, " SELECT sepal_length, COUNT(*) AS n FROM my_table WHERE species = 'species=setosa' GROUP BY sepal_length ") # clean up duckdb_unregister(con, "my_table") dbDisconnect(con)
- filtering using a partition, the WHERE format is ‘<partition_variable>=<partition_value>’
Example (using SQL Query; method 2)
library(duckdb) library(arrow) library(dplyr) # Reads Parquet File to an Arrow Table <- arrow::read_parquet("integers.parquet", as_data_frame = FALSE) arrow_table # Gets Database Connection <- dbConnect(duckdb::duckdb()) con # Registers arrow table as a DuckDB view ::to_duckdb(arrow_table, table_name = "arrow_table", con = con) arrow # we can run a SQL query on this and print the result print(dbGetQuery(con, "SELECT SUM(data) FROM arrow_table WHERE data > 50")) # Transforms Query Result from DuckDB to Arrow Table <- dbSendQuery(con, "SELECT * FROM arrow_table") result
Example (using dplyr)
library(duckdb) library(arrow) library(dplyr) # Open dataset using year,month folder partition <- arrow::open_dataset("nyc-taxi", partitioning = c("year", "month")) ds %>% ds # Look only at 2015 on, where the number of passenger is positive, the trip distance is # greater than a quarter mile, and where the fare amount is positive filter(year > 2014 & passenger_count > 0 & trip_distance > 0.25 & fare_amount > 0) %>% # Pass off to DuckDB to_duckdb() %>% group_by(passenger_count) %>% mutate(tip_pct = tip_amount / fare_amount) %>% summarize( fare_amount = mean(fare_amount, na.rm = TRUE), tip_amount = mean(tip_amount, na.rm = TRUE), tip_pct = mean(tip_pct, na.rm = TRUE) %>% ) arrange(passenger_count) %>% collect()
- In the docs, the example has
to_duckdb
after thegroup_by
. Not sure if that makes a difference in speed.
- In the docs, the example has
Example (Streaming Data)
# Reads dataset partitioning it in year/month folder = open_dataset("nyc-taxi/", partitioning = c("year", "month")) nyc_dataset # Gets Database Connection <- dbConnect(duckdb::duckdb()) con # We can use the same function as before to register our arrow dataset ::duckdb_register_arrow(con, "nyc", nyc_dataset) duckdb <- dbSendQuery(con, "SELECT * FROM nyc", arrow = TRUE) res # DuckDB's queries can now produce a Record Batch Reader <- duckdb::duckdb_fetch_record_batch(res) record_batch_reader # Which means we can stream the whole query per batch. # This retrieves the first batch <- record_batch_reader$read_next_batch() cur_batch
SQL
CSV Parser
Read first 10 lines
:::sql("FROM Data8277.csv LIMIT 10") duckdb
Print schema
:::sql("DESCRIBE FROM Data8277.csv") duckdb#> column_name column_type ... #> 1 Year BIGINT ... #> 2 Age VARCHAR ... #> 3 Ethnic BIGINT ... #> 4 Sex BIGINT ... #> 5 Area VARCHAR ... #> 6 count VARCHAR ...
Summary Stats
:::sql("SUMMARIZE FROM Data8277.csv") duckdb#> # A tibble: 6 × 12 #> column_name column_type min max approx_unique avg std q25 q50 #> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> #> 1 Year BIGINT 2006 2018 3 2012.33… 4.92… 2006 2013 #> 2 Age VARCHAR 000 999999 149 NA NA NA NA #> 3 Ethnic BIGINT 1 9999 11 930.545… 2867… 3 6 #> 4 Sex BIGINT 1 9 3 4.0 3.55… 1 2 #> 5 Area VARCHAR 001 DHB9999 2048 NA NA NA NA #> 6 count VARCHAR ..C 9999 16825 NA NA NA NA # ℹ 3 more variables: q75 <chr>, count <dbl>, null_percentage <dbl>
Connect to a duckdb file and read out of memory
<- con_ro dbConnect(duckdb::duckdb(), dbdir = "pixar.duckdb", read_only = TRUE) <- tbl(con_ro, "pixar_films") my_pixar_films
Example: Connect, Read in Parallel, and Query
<- paste0( sqltxt "select state, min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from read_csv('", file_name, "', parallel = true, delim = ',', header = true, columns = { 'measurement': 'DOUBLE', 'state': 'VARCHAR' } ) group by state" )<- con dbConnect(duckdb(), dbdir = ":memory:") dbGetQuery(con, sqltxt)dbDisconnect(con, shutdown = TRUE) gc()
- Fastest method besides polars for running this operation in this benchmark
Star Expressions
Allows you dynamically select columns
-- select all columns present in the FROM clause SELECT * FROM table_name; -- select all columns from the table called "table_name" SELECT table_name.* FROM table_name JOIN other_table_name USING (id); -- select all columns except the city column from the addresses table SELECT * EXCLUDE (city) FROM addresses; -- select all columns from the addresses table, but replace city with LOWER(city) SELECT * REPLACE (LOWER(city) AS city) FROM addresses; -- select all columns matching the given expression SELECT COLUMNS(c -> c LIKE '%num%') FROM addresses; -- select all columns matching the given regex from the table SELECT COLUMNS('number\d+') FROM addresses;
Remote Connections
- Misc
- Notes from
- httpfs Extension
Create a db in memory since the data is stored remotely.
<- conn ::dbConnect( DBI::duckdb(), duckdbdbdir = ":memory:" )
Install and Load httpfs extension
::dbExecute(conn, "INSTALL httpfs;") DBI::dbExecute(conn, "LOAD httpfs;") DBI
- Currently not available for Windows
Query
<- "url_to_parquet_files" parquet_url <- DBI::dbGetQuery( res conn, ::glue("SELECT carrier, flight, tailnum, year FROM '{parquet_url}' WHERE year = 2013 LIMIT 100") glue )
- Queries that needs more data and return more rows takes longer to run, especially transmitting data over the Internet. Craft carefully your queries with this in mind.
To use {dplyr}, a View must first be created
::dbExecute(conn, DBI::glue("CREATE VIEW flights AS SELECT * FROM PARQUET_SCAN('{parquet_url}')")) glue::dbListTables(conn) DBI#> [1] "flights" tbl(conn, "flights") %>% group_by(month) %>% summarise(freq = n()) %>% ungroup() %>% collect()
Close connection:
DBI::dbDisconnect(conn, shutdown = TRUE)
- {duckdbfs}
Create dataset object
<- "url_to_parquet_files" #e.g. AWS S3 parquet_url <- duckdbfs::open_dataset(parquet_url) ds
Query
%>% ds group_by(month) %>% summarise(freq = n()) %>% ungroup() %>% collect()
Extensions
- Community Extensions Repository (Intro)
Install
FROM community; INSTALL extension_name
Use
LOAD extension_name;
- VS Code extension
- Connect to a local DuckDB instance
- Create new in-memory DuckDB instance
- View DuckDB tables, columns, and views
- Run SQL queries on open DuckDB connections
- Attach SQLite database files to in-memory DuckDB instances
- Query remote CSV and Parquet data files with DuckDB HTTPFS extension
- Create in-memory DuckDB tables from remote data sources and query results
- Manage DuckDB connections in SQLTools Database Explorer
- Autocomplete SQL keywords, table names, column names, and view names on open database connections in VSCode SQL editor
- Save named SQL query Bookmarks
- Use SQL Query History
- Export SQL query results in CSV and JSON data formats
- integrate with the equally spiffy SQL Tools extension
- JSON extension
Example: From hrbrmstr drop
'json'; INSTALL 'json'; LOAD COPY ( SELECT * FROM ( SELECT DISTINCT cve_id, unnest( regexp_split_to_array( concat_ws(',', case when cweId1 IS NOT NULL THEN cweId1 ELSE regexp_replace(json_extract_string(problem1, '$.description'), '[: ].*$', '') END, '^(CWE-[0-9]+)', 0), regexp_extract(case when cweId2 IS NOT NULL THEN cweId2 ELSE regexp_replace(json_extract_string(problem2, '$.description'), '[: ].*$', '') END, '^(CWE-[0-9]+)', 0) regexp_extract( ),',' )AS cwe_id ) FROM ( SELECT '$.cveId') AS cve_id, json_extract_string(cveMetadata, '$.cna.problemTypes[0].descriptions[0]') AS problem1, json_extract(containers, '$.cna.problemTypes[0].descriptions[1]') AS problem2, json_extract(containers, '$.cna.problemTypes[0].cweId[0]') AS cweId1, json_extract_string(containers, '$.cna.problemTypes[0].cweId[1]') AS cweId2 json_extract_string(containers, FROM "/data/cvelistV5/cves/*/*/*.json", ignore_errors = true) read_json_auto( )WHERE '$.type') = 'CWE' OR (json_extract_string(problem1, '$.type') = 'CWE') json_extract_string(problem2, )WHERE cwe_id LIKE 'CWE-%' TO '/data/summaries/cve-to-cwe.csv' (HEADER, DELIMETER ',') )
- Processes a nested json
- Clones the CVE list repo, modify the directory paths and run it. It burns through nearly 220K hideous JSON files in mere seconds, even with some complex JSON operations.
DBs
-
- Might need to use
FORCE INSTALL postgres
- Might need to use
Allows DuckDB to connect to those systems and operate on them in the same way that it operates on its own native storage engine.
Use Cases
- Export data from SQLite to JSON
- Read data from Parquet into Postgres
- Move data from MySQL to Postgres
- Deleting rows, updating values, or altering the schema of a table in another DB
Notes from
- Multi-Database Support in DuckDB
- Has other examples including transaction operations
- Multi-Database Support in DuckDB
Example: Open SQLite db file
'sakila.db' AS sakila (TYPE sqlite); ATTACH SELECT title, release_year, length FROM sakila.film LIMIT 5;
ATTACH
opens the db file andTYPE
says that it’s a SQLite db fileMultiple dbs without using
TYPE
'sqlite:sakila.db' AS sqlite; ATTACH 'postgres:dbname=postgresscanner' AS postgres; ATTACH 'mysql:user=root database=mysqlscanner' AS mysql; ATTACH
In python
import duckdb = duckdb.connect('sqlite:file.db') con
Example: Switch between attached dbs
USE sakila; SELECT first_name, last_name FROM actor LIMIT 5;
USE
switches from the previous db to the “sakila” db
Example: View all attached dbs
SELECT database_name, path, type FROM duckdb_databases;
Example: Copy table from one db type to another
CREATE TABLE mysql.film AS FROM sqlite.film; CREATE TABLE postgres.actor AS FROM sqlite.actor;
Example: Joins
SELECT first_name, last_name FROM mysql.film JOIN sqlite.film_actor ON (film.film_id = film_actor.film_id) JOIN postgres.actor ON (actor.actor_id = film_actor.actor_id) WHERE title = 'ACE GOLDFINGER';
-
- GSheets
- Repo
- Extension for reading and writing Google Sheets with SQL