DuckDB

Misc

  • Also see
  • High performance embedded database for analytics which provides a few enhancements over SQLite such as increased speed and allowing a larger number of columns
    • Column storage is optimal for analytics since only certain columns are typically required in a query.
    • Faster than sqlite for most analytics queries (sums, aggregates etc).
      • Vectorizes query executions (columnar-oriented), while other DBMSs (SQLite, PostgreSQL…) process each row sequentially
    • DuckDB’s memory usage is more optimal for analytics
      • DuckDB only loads vectorized columns of data it needs to process at a given moment (i.e. lazy loading), which helps manage memory efficiently.
        • If your data exceeds available RAM, DuckDB will intelligently manage memory and use disk as needed, though this might result in slower performance compared to fully in-memory operations
      • Relational DBs (e.g. Postgres) bring row-based data pages into memory using a caching mechanism.
        • When a query accesses data, it retrieves entire data pages (which may include rows you don’t need) into the cache. These pages are stored in memory to avoid repeated disk access.
        • This method can handle large datasets, the process may involve more disk I/O and memory consumption due to the nature of page-level data loading.
  • Unlike some other big data tools it is entirely self-contained. (aka embedded, in-process)
    • No external dependencies, or server software to install, update, or maintain
  • Can directly run queries on Parquet files, CSV files, SQLite files, postgres files, Pandas, R and Julia data frames as well as Apache Arrow sources
  • Resources
  • Extensions
  • Packages
    • {duckdb} - DuckDB R API
    • {duckplyr} - {dplyr}-compatible API
    • {duckdbfs} - A simple wrapper around the duckdb package to facilitate working with the construction of a single lazy table (SQL connection) from a set of file paths, URLs, or S3 URIs.
  • Tools
    • SQL Workbench - Query parquet files locally or remotely. Can also produce charts of results. Uses DuckDB-WASM so browser based.
      • Tutorial - Along with explaining the features of the tool, it has complete normalization example and analysis.
      • For visualizations, click the configure button on the right side of the Results sections (bottom main), click Data Grid, choose a chart type, drag column names from the bottom to various areas (similar to Tableau). Click the Reset button in the toolbar close to the configure button to return to Table mode.
      • For tables, if you right-click their name in the Schema pane (far-left), you get a list of options including Summarize which gives summary stats along with uniques and null % for missing data.
      • If tables have foreign keys, data models can be visualized in a mermaid diagram by clicking Data Modes in the bottom-left of the schema panel

Set-up

  • Installation: install.packages("duckdb")
  • Check version: duckdb:::sql("SELECT version()")
  • Create db and populate table from csv
    • Example

    • Example

      # includes filename/id
      withr::with_dir("data-raw/files/", {
        dbSendQuery(
          con, "
          CREATE TABLE files AS
          SELECT *, regexp_extract(filename, '\\d{7}') AS file_number
          FROM read_csv_auto('*Control*File-*.txt', FILENAME = TRUE);"
        )
      })

CLI

  • Start interactive shell: duckdb

  • Start interactive shell on database file: duckdb path\to\file

  • Query csv, json, or parquet file directly

    duckdb -c "SELECT * FROM 'data_source.[csv|csv.gz|json|json.gz|parqet]'"
  • Run SQL script: duckdb -c ".read path\to\script.sql"

d/dbplyr

  • Example: Connect, Read in Parallel, and Summarize

    con <- 
      dbConnect(duckdb(), 
                ":memory:")
    df <- 
      dplyr::tbl(con, 
                 paste0("read_csv('",
                        file_name,
                        "',
                        parallel = true,
                        delim = ',',
                        header = true,
                        columns = {
                            'measurement': 'DOUBLE',
                            'state': 'VARCHAR'
                        })"), 
                 check_from = FALSE)
    df <- df |>
      summarize(
        .by = state,
        mean = mean(measurement),
        min = min(measurement),
        max = max(measurement)
      ) |>
      collect()
    df <- NULL
    dbDisconnect(con, shutdown = TRUE)
    gc()
    • Competative with running the operation in SQL
  • Example Connect to db; Write a df to table; Query it

    library(dbplyr)
    
    duck = DBI::dbConnect(duckdb::duckdb(), dbdir="duck.db", read_only=FALSE)
    DBI::dbWriteTable(duck, name = "sales", value = sales)
    sales_duck <- tbl(duck, "sales")
    
    sales_duck %>%
      group_by(year, SKU) %>%
      mutate(pos_sales = case_when(
              sales_units > 0 ~ sales_units,
              TRUE ~ 0)) %>%
      summarize(total_revenue = sum(sales_units * item_price_eur),
                max_order_price = max(pos_sales * item_price_eur),
                avg_price_SKU = mean(item_price_eur),
                items_sold = n())
    
    DBI::dbDisconnect(duck)

duckplyr

  • Misc

    • Currently can’t connect to a duckdb database directly (See dbplyr section). Only uses duckdb as a backend to dplyr for files (e.g. parquet, csv) that have been read and virtualized or an in-frame dataframe.
    • Using library(duckplyr) will overwrite all {dplyr} verbs to the duckplyr versions
      • So any {dplyr} features not covered by {duckplyr} won’t work
      • methods_restore will return verbs to standard dplyr functions
  • Read files (not in memory)

    duckplyr_df <- duckplyr_df_from_parquet("flights.parquet")
    dat   <- duckplyr_df_from_csv("Data8277.csv")
    • Detects the schemas
    • Creates placeholder objects
  • Collect Results

    • With library(duckplyr)

      library(duckplyr)
      dat <- 
        palmerpenguins::penguins |>
        mutate(bill_area = bill_length_mm * bill_depth_mm) |>
        as_duckplyr_tibble()
      • as_duckplyr_tibble triggers the usage of DuckDB for the calculation

        • If the DuckDB calculation is not possible, then original dplyr implementation is used
      • Or use glimpse, print or nrow if you don’t want to save the results to an object

    • Without library(duckplyr) you can still use duckdb for a backend.

      out <-
        palmerpenguins::penguins %>%
        # CAVEAT: factor columns are not supported yet
        mutate(across(where(is.factor), as.character)) %>%
        duckplyr::as_duckplyr_tibble() %>%
        mutate(bill_area = bill_length_mm * bill_depth_mm) %>%
        summarize(.by = c(species, sex), mean_bill_area = mean(bill_area)) %>%
        filter(species != "Gentoo")
      • as_duckplyr_tibble triggers the usage of DuckDB for the calculation, but verbs after that line function like normal {dplyr} verbs.
  • Show query plan with explain

    library("duckplyr")
    as_duckplyr_df(data.frame(n=1:10)) |> 
      mutate(m = n + 1) |> 
      filter (m > 5) |> 
      count() |> 
      explain()
    
    #> ┌───────────────────────────┐
    #> │         PROJECTION        │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │             n             │
    #> └─────────────┬─────────────┘                             
    #> ┌─────────────┴─────────────┐
    #> │    UNGROUPED_AGGREGATE    │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │        count_star()       │
    #> └─────────────┬─────────────┘                            
    #> ┌─────────────┴─────────────┐
    #> │           FILTER          │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │(+(CAST(n AS DOUBLE), 1.0) │
    #> │           > 5.0)          │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │           EC: 10          │
    #> └─────────────┬─────────────┘                             
    #> ┌─────────────┴─────────────┐
    #> │     R_DATAFRAME_SCAN      │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │         data.frame        │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │             n             │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │           EC: 10          │
    #> └───────────────────────────┘
    • Query plan consists of a scan, a filter, projections and an aggregate.
  • Expressions that aren’t support won’t have “Projection” in the query plan

    as_duckplyr_df(data.frame(n=1:10)) |> 
      mutate(m=(\(x) x+1)(n)) |> 
      explain()
    
    #> ┌───────────────────────────┐
    #> │     R_DATAFRAME_SCAN      │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │         data.frame        │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │             n             │
    #> │             m             │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │           EC: 10          │
    #> └───────────────────────────┘
    • The translation of the anonymous function failed which caused it to fallback to the computation happening in the R engine.
    • There will usually be a performance hit from the fallback due to – for example – the lack of automatic parallelization

Apache Arrow

  • to_duckdb() and to_arrow(): Converts between using {arrow} engine and {duckdb} engieg in workflow without paying any cost to (re)serialize the data when you pass it back and forth

    • Useful in cases where something is supported in one of Arrow or DuckDB but not the other
  • Benefits

    • Utilization of a parallel vectorized execution engine without requiring any extra data copying
    • Larger Than Memory Analysis: Since both libraries support streaming query results, we are capable of executing on data without fully loading it from disk. Instead, we can execute one batch at a time. This allows us to execute queries on data that is bigger than memory.
    • Complex Data Types: DuckDB can efficiently process complex data types that can be stored in Arrow vectors, including arbitrarily nested structs, lists, and maps.
    • Advanced Optimizer: DuckDB’s state-of-the-art optimizer can push down filters and projections directly into Arrow scans. As a result, only relevant columns and partitions will be read, allowing the system to e.g., take advantage of partition elimination in Parquet files. This significantly accelerates query execution.
  • Example (using a SQL Query; method 1)

    # open dataset
    ds <- arrow::open_dataset(dir_out, partitioning = "species")
    # open connection to DuckDB
    con <- dbConnect(duckdb::duckdb())
    # register the dataset as a DuckDB table, and give it a name
    duckdb::duckdb_register_arrow(con, "my_table", ds)
    # query
    dbGetQuery(con, "
      SELECT sepal_length, COUNT(*) AS n
      FROM my_table
      WHERE species = 'species=setosa'
      GROUP BY sepal_length
    ")
    
    # clean up
    duckdb_unregister(con, "my_table")
    dbDisconnect(con)
    • filtering using a partition, the WHERE format is ‘<partition_variable>=<partition_value>’
  • Example (using SQL Query; method 2)

    library(duckdb)
    library(arrow)
    library(dplyr)
    
    # Reads Parquet File to an Arrow Table
    arrow_table <- arrow::read_parquet("integers.parquet", as_data_frame = FALSE)
    
    # Gets Database Connection
    con <- dbConnect(duckdb::duckdb())
    
    # Registers arrow table as a DuckDB view
    arrow::to_duckdb(arrow_table, table_name = "arrow_table", con = con)
    
    # we can run a SQL query on this and print the result
    print(dbGetQuery(con, "SELECT SUM(data) FROM arrow_table WHERE data > 50"))
    
    # Transforms Query Result from DuckDB to Arrow Table
    result <- dbSendQuery(con, "SELECT * FROM arrow_table")
  • Example (using dplyr)

    library(duckdb)
    library(arrow)
    library(dplyr)
    
    # Open dataset using year,month folder partition
    ds <- arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
    
    ds %>%
      # Look only at 2015 on, where the number of passenger is positive, the trip distance is
      # greater than a quarter mile, and where the fare amount is positive
      filter(year > 2014 & passenger_count > 0 & trip_distance > 0.25 & fare_amount > 0) %>%
      # Pass off to DuckDB
      to_duckdb() %>%
      group_by(passenger_count) %>%
      mutate(tip_pct = tip_amount / fare_amount) %>%
      summarize(
        fare_amount = mean(fare_amount, na.rm = TRUE),
        tip_amount = mean(tip_amount, na.rm = TRUE),
        tip_pct = mean(tip_pct, na.rm = TRUE)
      ) %>%
      arrange(passenger_count) %>%
      collect()
    • In the docs, the example has to_duckdb after the group_by. Not sure if that makes a difference in speed. 
  • Example (Streaming Data)

    # Reads dataset partitioning it in year/month folder
    nyc_dataset = open_dataset("nyc-taxi/", partitioning = c("year", "month"))
    
    # Gets Database Connection
    con <- dbConnect(duckdb::duckdb())
    
    # We can use the same function as before to register our arrow dataset
    duckdb::duckdb_register_arrow(con, "nyc", nyc_dataset)
    
    res <- dbSendQuery(con, "SELECT * FROM nyc", arrow = TRUE)
    # DuckDB's queries can now produce a Record Batch Reader
    record_batch_reader <- duckdb::duckdb_fetch_record_batch(res)
    
    # Which means we can stream the whole query per batch.
    # This retrieves the first batch
    cur_batch <- record_batch_reader$read_next_batch()

SQL

  • CSV Parser

    • Read first 10 lines

      duckdb:::sql("FROM Data8277.csv LIMIT 10")
    • Print schema

      duckdb:::sql("DESCRIBE FROM Data8277.csv")
      #>   column_name column_type ...
      #> 1        Year      BIGINT ...
      #> 2         Age     VARCHAR ...
      #> 3      Ethnic      BIGINT ...
      #> 4         Sex      BIGINT ...
      #> 5        Area     VARCHAR ...
      #> 6       count     VARCHAR ...
    • Summary Stats

      duckdb:::sql("SUMMARIZE FROM Data8277.csv")
      #> # A tibble: 6 × 12
      #>   column_name column_type min   max     approx_unique avg      std   q25   q50  
      #>   <chr>       <chr>       <chr> <chr>           <dbl> <chr>    <chr> <chr> <chr>
      #> 1 Year        BIGINT      2006  2018                3 2012.33… 4.92… 2006  2013 
      #> 2 Age         VARCHAR     000   999999            149 NA       NA    NA    NA   
      #> 3 Ethnic      BIGINT      1     9999               11 930.545… 2867… 3     6    
      #> 4 Sex         BIGINT      1     9                   3 4.0      3.55… 1     2    
      #> 5 Area        VARCHAR     001   DHB9999          2048 NA       NA    NA    NA   
      #> 6 count       VARCHAR     ..C   9999            16825 NA       NA    NA    NA   
      # ℹ 3 more variables: q75 <chr>, count <dbl>, null_percentage <dbl>
  • Connect to a duckdb file and read out of memory

    con_ro <- 
      dbConnect(duckdb::duckdb(), 
                dbdir = "pixar.duckdb", 
                read_only = TRUE)
    my_pixar_films <- tbl(con_ro, "pixar_films")
  • Example: Connect, Read in Parallel, and Query

    sqltxt <- paste0(
      "select
            state, min(measurement) as min_m,
            max(measurement) as max_m,
            avg(measurement) as mean_m
      from read_csv('", file_name, "',
            parallel = true,
            delim = ',',
            header = true,
            columns = {
                'measurement': 'DOUBLE',
                'state': 'VARCHAR'
            }
      )
      group by state"
    )
    con <- 
      dbConnect(duckdb(), 
                dbdir = ":memory:")
    dbGetQuery(con, 
               sqltxt)
    dbDisconnect(con, 
                 shutdown = TRUE)
    gc()
    • Fastest method besides polars for running this operation in this benchmark
  • Star Expressions

    • Allows you dynamically select columns

      -- select all columns present in the FROM clause
      SELECT * FROM table_name;
      -- select all columns from the table called "table_name"
      SELECT table_name.* FROM table_name JOIN other_table_name USING (id);
      -- select all columns except the city column from the addresses table
      SELECT * EXCLUDE (city) FROM addresses;
      -- select all columns from the addresses table, but replace city with LOWER(city)
      SELECT * REPLACE (LOWER(city) AS city) FROM addresses;
      -- select all columns matching the given expression
      SELECT COLUMNS(c -> c LIKE '%num%') FROM addresses;
      -- select all columns matching the given regex from the table
      SELECT COLUMNS('number\d+') FROM addresses;

Remote Connections

  • Misc
  • httpfs Extension
    • Create a db in memory since the data is stored remotely.

      conn <- 
        DBI::dbConnect(
          duckdb::duckdb(),
          dbdir = ":memory:"
        )
    • Install and Load httpfs extension

      DBI::dbExecute(conn, "INSTALL httpfs;")
      DBI::dbExecute(conn, "LOAD httpfs;")
      • Currently not available for Windows
    • Query

      parquet_url <- "url_to_parquet_files"
      res <- DBI::dbGetQuery(
        conn, 
        glue::glue("SELECT carrier, flight, tailnum, year FROM '{parquet_url}' WHERE year = 2013 LIMIT 100")
      )
      • Queries that needs more data and return more rows takes longer to run, especially transmitting data over the Internet. Craft carefully your queries with this in mind.
    • To use {dplyr}, a View must first be created

      DBI::dbExecute(conn, 
                     glue::glue("CREATE VIEW flights AS SELECT * FROM PARQUET_SCAN('{parquet_url}')"))
      DBI::dbListTables(conn)
      #> [1] "flights"
      
      tbl(conn, "flights") %>%
        group_by(month) %>%
        summarise(freq = n()) %>%
        ungroup() %>%
        collect()
    • Close connection: DBI::dbDisconnect(conn, shutdown = TRUE)

  • {duckdbfs}
    • Create dataset object

      parquet_url <- "url_to_parquet_files" #e.g. AWS S3
      ds <- duckdbfs::open_dataset(parquet_url)
    • Query

      ds %>%
        group_by(month) %>%
        summarise(freq = n()) %>%
        ungroup() %>%
        collect()

Extensions

  • Community Extensions Repository (Intro)
    • Install

      INSTALL extension_name FROM community;
    • Use

      LOAD extension_name;
  • VS Code extension
    • Connect to a local DuckDB instance
    • Create new in-memory DuckDB instance
    • View DuckDB tables, columns, and views
    • Run SQL queries on open DuckDB connections
    • Attach SQLite database files to in-memory DuckDB instances
    • Query remote CSV and Parquet data files with DuckDB HTTPFS extension
    • Create in-memory DuckDB tables from remote data sources and query results
    • Manage DuckDB connections in SQLTools Database Explorer
    • Autocomplete SQL keywords, table names, column names, and view names on open database connections in VSCode SQL editor
    • Save named SQL query Bookmarks
    • Use SQL Query History
    • Export SQL query results in CSV and JSON data formats
    • integrate with the equally spiffy SQL Tools extension
  • JSON extension
    • Example: From hrbrmstr drop

      INSTALL 'json';
      LOAD 'json';
      
      COPY (
        SELECT * FROM (
          SELECT DISTINCT
            cve_id,
            unnest(
              regexp_split_to_array(
                concat_ws(
                  ',',
                  regexp_extract(case when cweId1 IS NOT NULL THEN cweId1 ELSE regexp_replace(json_extract_string(problem1, '$.description'), '[: ].*$', '') END, '^(CWE-[0-9]+)', 0),
                  regexp_extract(case when cweId2 IS NOT NULL THEN cweId2 ELSE regexp_replace(json_extract_string(problem2, '$.description'), '[: ].*$', '') END, '^(CWE-[0-9]+)', 0)
                ),
                ','
              )
            ) AS cwe_id
          FROM (
            SELECT 
              json_extract_string(cveMetadata, '$.cveId') AS cve_id, 
              json_extract(containers, '$.cna.problemTypes[0].descriptions[0]') AS problem1,
              json_extract(containers, '$.cna.problemTypes[0].descriptions[1]') AS problem2,
              json_extract_string(containers, '$.cna.problemTypes[0].cweId[0]') AS cweId1,
              json_extract_string(containers, '$.cna.problemTypes[0].cweId[1]') AS cweId2
            FROM 
              read_json_auto("/data/cvelistV5/cves/*/*/*.json", ignore_errors = true) 
          )
          WHERE 
            (json_extract_string(problem1, '$.type') = 'CWE' OR
             json_extract_string(problem2, '$.type') = 'CWE')
          )
        WHERE cwe_id LIKE 'CWE-%'
      ) TO '/data/summaries/cve-to-cwe.csv' (HEADER, DELIMETER ',')
      • Processes a nested json
      • Clones the CVE list repo, modify the directory paths and run it. It burns through nearly 220K hideous JSON files in mere seconds, even with some complex JSON operations.
    • DBs

      • MySQL, Postgres, SQLite

        • Might need to use FORCE INSTALL postgres
      • Allows DuckDB to connect to those systems and operate on them in the same way that it operates on its own native storage engine.

      • Use Cases

        • Export data from SQLite to JSON
        • Read data from Parquet into Postgres
        • Move data from MySQL to Postgres
        • Deleting rows, updating values, or altering the schema of a table in another DB
      • Notes from

      • Example: Open SQLite db file

        ATTACH 'sakila.db' AS sakila (TYPE sqlite);
        SELECT title, release_year, length FROM sakila.film LIMIT 5;
        • ATTACH opens the db file and TYPE says that it’s a SQLite db file

        • Multiple dbs without using TYPE

          ATTACH 'sqlite:sakila.db' AS sqlite;
          ATTACH 'postgres:dbname=postgresscanner' AS postgres;
          ATTACH 'mysql:user=root database=mysqlscanner' AS mysql;
        • In python

          import duckdb
          con = duckdb.connect('sqlite:file.db')
      • Example: Switch between attached dbs

        USE sakila;
        SELECT first_name, last_name FROM actor LIMIT 5;
        • USE switches from the previous db to the “sakila” db
      • Example: View all attached dbs

        SELECT database_name, path, type FROM duckdb_databases;
      • Example: Copy table from one db type to another

        CREATE TABLE mysql.film AS FROM sqlite.film;
        CREATE TABLE postgres.actor AS FROM sqlite.actor;
      • Example: Joins

        SELECT first_name, last_name
        FROM mysql.film
        JOIN sqlite.film_actor ON (film.film_id = film_actor.film_id)
        JOIN postgres.actor ON (actor.actor_id = film_actor.actor_id)
        WHERE title = 'ACE GOLDFINGER';
  • GSheets
    • Repo
    • Extension for reading and writing Google Sheets with SQL