DuckDB

Misc

  • Also see
  • High performance embedded database for analytics which provides a few enhancements over SQLite such as increased speed and allowing a larger number of columns
    • Faster than sqlite for most analytics queries (sums, aggregates etc).
      • Vectorizes query executions (columnar-oriented), while other DBMSs (SQLite, PostgreSQL…) process each row sequentially
  • Unlike some other big data tools it is entirely self-contained. (aka embedded, in-process)
    • No external dependencies, or server software to install, update, or maintain
  • Can directly run queries on Parquet files, CSV files, SQLite files, postgres files, Pandas, R and Julia data frames as well as Apache Arrow sources
  • Resources
  • Extensions
  • Tools
    • SQL Workbench - Query parquet files locally or remotely. Can also produce charts of results. Uses DuckDB-WASM so browser based.
      • Tutorial - Along with explaining the features of the tool, it has complete normalization example and analysis.
      • For visualizations, click the configure button on the right side of the Results sections (bottom main), click Data Grid, choose a chart type, drag column names from the bottom to various areas (similar to Tableau). Click the Reset button in the toolbar close to the configure button to return to Table mode.
      • For tables, if you right-click their name in the Schema pane (far-left), you get a list of options including Summarize which gives summary stats along with uniques and null % for missing data.
      • If tables have foreign keys, data models can be visualized in a mermaid diagram by clicking Data Modes in the bottom-left of the schema panel

Set-up

  • Installation: install.packages("duckdb")
  • Create db and populate table from csv
    • Example

    • Example

      # includes filename/id
      withr::with_dir("data-raw/files/", {
        dbSendQuery(
          con, "
          CREATE TABLE files AS
          SELECT *, regexp_extract(filename, '\\d{7}') AS file_number
          FROM read_csv_auto('*Control*File-*.txt', FILENAME = TRUE);"
        )
      })

CLI

  • Start interactive shell: duckdb

  • Start interactive shell on database file: duckdb path\to\file

  • Query csv, json, or parquet file directly

    duckdb -c "SELECT * FROM 'data_source.[csv|csv.gz|json|json.gz|parqet]'"
  • Run SQL script: duckdb -c ".read path\to\script.sql"

d/dbplyr

  • Example: Connect, Read in Parallel, and Summarize

    con <- 
      dbConnect(duckdb(), 
                ":memory:")
    df <- 
      dplyr::tbl(con, 
                 paste0("read_csv('",
                        file_name,
                        "',
                        parallel = true,
                        delim = ',',
                        header = true,
                        columns = {
                            'measurement': 'DOUBLE',
                            'state': 'VARCHAR'
                        })"), 
                 check_from = FALSE)
    df <- df |>
      summarize(
        .by = state,
        mean = mean(measurement),
        min = min(measurement),
        max = max(measurement)
      ) |>
      collect()
    df <- NULL
    dbDisconnect(con, shutdown = TRUE)
    gc()
    • Competative with running the operation in SQL
  • Example Connect to db; Write a df to table; Query it

    library(dbplyr)
    
    duck = DBI::dbConnect(duckdb::duckdb(), dbdir="duck.db", read_only=FALSE)
    DBI::dbWriteTable(duck, name = "sales", value = sales)
    sales_duck <- tbl(duck, "sales")
    
    sales_duck %>%
      group_by(year, SKU) %>%
      mutate(pos_sales = case_when(
              sales_units > 0 ~ sales_units,
              TRUE ~ 0)) %>%
      summarize(total_revenue = sum(sales_units * item_price_eur),
                max_order_price = max(pos_sales * item_price_eur),
                avg_price_SKU = mean(item_price_eur),
                items_sold = n())
    
    DBI::dbDisconnect(duck)

duckplyr

  • Misc

  • Read parquet file into memory

    duckplyr_df <- duckplyr_df_from_parquet("flights.parquet")
  • Show query plan with explain

    library("duckplyr")
    as_duckplyr_df(data.frame(n=1:10)) |> 
      mutate(m = n + 1) |> 
      filter (m > 5) |> 
      count() |> 
      explain()
    
    #> ┌───────────────────────────┐
    #> │         PROJECTION        │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │             n             │
    #> └─────────────┬─────────────┘                             
    #> ┌─────────────┴─────────────┐
    #> │    UNGROUPED_AGGREGATE    │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │        count_star()       │
    #> └─────────────┬─────────────┘                            
    #> ┌─────────────┴─────────────┐
    #> │           FILTER          │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │(+(CAST(n AS DOUBLE), 1.0) │
    #> │           > 5.0)          │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │           EC: 10          │
    #> └─────────────┬─────────────┘                             
    #> ┌─────────────┴─────────────┐
    #> │     R_DATAFRAME_SCAN      │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │         data.frame        │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │             n             │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │           EC: 10          │
    #> └───────────────────────────┘
    • Query plan consists of a scan, a filter, projections and an aggregate.
  • Expressions that aren’t support won’t have “Projection” in the query plan

    as_duckplyr_df(data.frame(n=1:10)) |> 
      mutate(m=(\(x) x+1)(n)) |> 
      explain()
    
    #> ┌───────────────────────────┐
    #> │     R_DATAFRAME_SCAN      │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │         data.frame        │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │             n             │
    #> │             m             │
    #> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
    #> │           EC: 10          │
    #> └───────────────────────────┘
    • The translation of the anonymous function failed which caused it to fallback to the computation happening in the R engine.
    • There will usually be a performance hit from the fallback due to – for example – the lack of automatic parallelization

Apache Arrow

  • to_duckdb() and to_arrow(): Converts between using {arrow} engine and {duckdb} engieg in workflow without paying any cost to (re)serialize the data when you pass it back and forth

    • Useful in cases where something is supported in one of Arrow or DuckDB but not the other
  • Benefits

    • Utilization of a parallel vectorized execution engine without requiring any extra data copying
    • Larger Than Memory Analysis: Since both libraries support streaming query results, we are capable of executing on data without fully loading it from disk. Instead, we can execute one batch at a time. This allows us to execute queries on data that is bigger than memory.
    • Complex Data Types: DuckDB can efficiently process complex data types that can be stored in Arrow vectors, including arbitrarily nested structs, lists, and maps.
    • Advanced Optimizer: DuckDB’s state-of-the-art optimizer can push down filters and projections directly into Arrow scans. As a result, only relevant columns and partitions will be read, allowing the system to e.g., take advantage of partition elimination in Parquet files. This significantly accelerates query execution.
  • Example (using a SQL Query; method 1)

    # open dataset
    ds <- arrow::open_dataset(dir_out, partitioning = "species")
    # open connection to DuckDB
    con <- dbConnect(duckdb::duckdb())
    # register the dataset as a DuckDB table, and give it a name
    duckdb::duckdb_register_arrow(con, "my_table", ds)
    # query
    dbGetQuery(con, "
      SELECT sepal_length, COUNT(*) AS n
      FROM my_table
      WHERE species = 'species=setosa'
      GROUP BY sepal_length
    ")
    
    # clean up
    duckdb_unregister(con, "my_table")
    dbDisconnect(con)
    • filtering using a partition, the WHERE format is ‘<partition_variable>=<partition_value>’
  • Example (using SQL Query; method 2)

    library(duckdb)
    library(arrow)
    library(dplyr)
    
    # Reads Parquet File to an Arrow Table
    arrow_table <- arrow::read_parquet("integers.parquet", as_data_frame = FALSE)
    
    # Gets Database Connection
    con <- dbConnect(duckdb::duckdb())
    
    # Registers arrow table as a DuckDB view
    arrow::to_duckdb(arrow_table, table_name = "arrow_table", con = con)
    
    # we can run a SQL query on this and print the result
    print(dbGetQuery(con, "SELECT SUM(data) FROM arrow_table WHERE data > 50"))
    
    # Transforms Query Result from DuckDB to Arrow Table
    result <- dbSendQuery(con, "SELECT * FROM arrow_table")
  • Example (using dplyr)

    library(duckdb)
    library(arrow)
    library(dplyr)
    
    # Open dataset using year,month folder partition
    ds <- arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
    
    ds %>%
      # Look only at 2015 on, where the number of passenger is positive, the trip distance is
      # greater than a quarter mile, and where the fare amount is positive
      filter(year > 2014 & passenger_count > 0 & trip_distance > 0.25 & fare_amount > 0) %>%
      # Pass off to DuckDB
      to_duckdb() %>%
      group_by(passenger_count) %>%
      mutate(tip_pct = tip_amount / fare_amount) %>%
      summarize(
        fare_amount = mean(fare_amount, na.rm = TRUE),
        tip_amount = mean(tip_amount, na.rm = TRUE),
        tip_pct = mean(tip_pct, na.rm = TRUE)
      ) %>%
      arrange(passenger_count) %>%
      collect()
    • In the docs, the example has to_duckdb after the group_by. Not sure if that makes a difference in speed. 
  • Example (Streaming Data)

    # Reads dataset partitioning it in year/month folder
    nyc_dataset = open_dataset("nyc-taxi/", partitioning = c("year", "month"))
    
    # Gets Database Connection
    con <- dbConnect(duckdb::duckdb())
    
    # We can use the same function as before to register our arrow dataset
    duckdb::duckdb_register_arrow(con, "nyc", nyc_dataset)
    
    res <- dbSendQuery(con, "SELECT * FROM nyc", arrow = TRUE)
    # DuckDB's queries can now produce a Record Batch Reader
    record_batch_reader <- duckdb::duckdb_fetch_record_batch(res)
    
    # Which means we can stream the whole query per batch.
    # This retrieves the first batch
    cur_batch <- record_batch_reader$read_next_batch()

SQL

  • Misc

    • Docs
  • Connect to a duckdb file and read out of memory

    con_ro <- 
      dbConnect(duckdb::duckdb(), 
                dbdir = "pixar.duckdb", 
                read_only = TRUE)
    my_pixar_films <- tbl(con_ro, "pixar_films")
  • Example: Connect, Read in Parallel, and Query

    sqltxt <- paste0(
      "select
            state, min(measurement) as min_m,
            max(measurement) as max_m,
            avg(measurement) as mean_m
      from read_csv('", file_name, "',
            parallel = true,
            delim = ',',
            header = true,
            columns = {
                'measurement': 'DOUBLE',
                'state': 'VARCHAR'
            }
      )
      group by state"
    )
    con <- 
      dbConnect(duckdb(), 
                dbdir = ":memory:")
    dbGetQuery(con, 
               sqltxt)
    dbDisconnect(con, 
                 shutdown = TRUE)
    gc()
    • Fastest method besides polars for running this operation in this benchmark
  • Star Expressions

    • Allows you dynamically select columns

      -- select all columns present in the FROM clause
      SELECT * FROM table_name;
      -- select all columns from the table called "table_name"
      SELECT table_name.* FROM table_name JOIN other_table_name USING (id);
      -- select all columns except the city column from the addresses table
      SELECT * EXCLUDE (city) FROM addresses;
      -- select all columns from the addresses table, but replace city with LOWER(city)
      SELECT * REPLACE (LOWER(city) AS city) FROM addresses;
      -- select all columns matching the given expression
      SELECT COLUMNS(c -> c LIKE '%num%') FROM addresses;
      -- select all columns matching the given regex from the table
      SELECT COLUMNS('number\d+') FROM addresses;

Remote Connections

  • Misc
  • httpfs Extension
    • Create a db in memory since the data is stored remotely.

      conn <- 
        DBI::dbConnect(
          duckdb::duckdb(),
          dbdir = ":memory:"
        )
    • Install and Load httpfs extension

      DBI::dbExecute(conn, "INSTALL httpfs;")
      DBI::dbExecute(conn, "LOAD httpfs;")
      • Currently not available for Windows
    • Query

      parquet_url <- "url_to_parquet_files"
      res <- DBI::dbGetQuery(
        conn, 
        glue::glue("SELECT carrier, flight, tailnum, year FROM '{parquet_url}' WHERE year = 2013 LIMIT 100")
      )
      • Queries that needs more data and return more rows takes longer to run, especially transmitting data over the Internet. Craft carefully your queries with this in mind.
    • To use {dplyr}, a View must first be created

      DBI::dbExecute(conn, 
                     glue::glue("CREATE VIEW flights AS SELECT * FROM PARQUET_SCAN('{parquet_url}')"))
      DBI::dbListTables(conn)
      #> [1] "flights"
      
      tbl(conn, "flights") %>%
        group_by(month) %>%
        summarise(freq = n()) %>%
        ungroup() %>%
        collect()
    • Close connection: DBI::dbDisconnect(conn, shutdown = TRUE)

  • {duckdbfs}
    • Create dataset object

      parquet_url <- "url_to_parquet_files" #e.g. AWS S3
      ds <- duckdbfs::open_dataset(parquet_url)
    • Query

      ds %>%
        group_by(month) %>%
        summarise(freq = n()) %>%
        ungroup() %>%
        collect()

Extensions

  • Community Extensions Repository (Intro)
    • Install

      INSTALL extension_name FROM community;
    • Use

      LOAD extension_name;
  • VS Code extension
    • Connect to a local DuckDB instance
    • Create new in-memory DuckDB instance
    • View DuckDB tables, columns, and views
    • Run SQL queries on open DuckDB connections
    • Attach SQLite database files to in-memory DuckDB instances
    • Query remote CSV and Parquet data files with DuckDB HTTPFS extension
    • Create in-memory DuckDB tables from remote data sources and query results
    • Manage DuckDB connections in SQLTools Database Explorer
    • Autocomplete SQL keywords, table names, column names, and view names on open database connections in VSCode SQL editor
    • Save named SQL query Bookmarks
    • Use SQL Query History
    • Export SQL query results in CSV and JSON data formats
    • integrate with the equally spiffy SQL Tools extension
  • JSON extension
    • Example: From hrbrmstr drop

      INSTALL 'json';
      LOAD 'json';
      
      COPY (
        SELECT * FROM (
          SELECT DISTINCT
            cve_id,
            unnest(
              regexp_split_to_array(
                concat_ws(
                  ',',
                  regexp_extract(case when cweId1 IS NOT NULL THEN cweId1 ELSE regexp_replace(json_extract_string(problem1, '$.description'), '[: ].*$', '') END, '^(CWE-[0-9]+)', 0),
                  regexp_extract(case when cweId2 IS NOT NULL THEN cweId2 ELSE regexp_replace(json_extract_string(problem2, '$.description'), '[: ].*$', '') END, '^(CWE-[0-9]+)', 0)
                ),
                ','
              )
            ) AS cwe_id
          FROM (
            SELECT 
              json_extract_string(cveMetadata, '$.cveId') AS cve_id, 
              json_extract(containers, '$.cna.problemTypes[0].descriptions[0]') AS problem1,
              json_extract(containers, '$.cna.problemTypes[0].descriptions[1]') AS problem2,
              json_extract_string(containers, '$.cna.problemTypes[0].cweId[0]') AS cweId1,
              json_extract_string(containers, '$.cna.problemTypes[0].cweId[1]') AS cweId2
            FROM 
              read_json_auto("/data/cvelistV5/cves/*/*/*.json", ignore_errors = true) 
          )
          WHERE 
            (json_extract_string(problem1, '$.type') = 'CWE' OR
             json_extract_string(problem2, '$.type') = 'CWE')
          )
        WHERE cwe_id LIKE 'CWE-%'
      ) TO '/data/summaries/cve-to-cwe.csv' (HEADER, DELIMETER ',')
      • Processes a nested json
      • Clones the CVE list repo, modify the directory paths and run it. It burns through nearly 220K hideous JSON files in mere seconds, even with some complex JSON operations.
    • DBs

      • MySQL, Postgres, SQLite

        • Might need to use FORCE INSTALL postgres
      • Allows DuckDB to connect to those systems and operate on them in the same way that it operates on its own native storage engine.

      • Use Cases

        • Export data from SQLite to JSON
        • Read data from Parquet into Postgres
        • Move data from MySQL to Postgres
        • Deleting rows, updating values, or altering the schema of a table in another DB
      • Notes from

      • Example: Open SQLite db file

        ATTACH 'sakila.db' AS sakila (TYPE sqlite);
        SELECT title, release_year, length FROM sakila.film LIMIT 5;
        • ATTACH opens the db file and TYPE says that it’s a SQLite db file

        • Multiple dbs without using TYPE

          ATTACH 'sqlite:sakila.db' AS sqlite;
          ATTACH 'postgres:dbname=postgresscanner' AS postgres;
          ATTACH 'mysql:user=root database=mysqlscanner' AS mysql;
        • In python

          import duckdb
          con = duckdb.connect('sqlite:file.db')
      • Example: Switch between attached dbs

        USE sakila;
        SELECT first_name, last_name FROM actor LIMIT 5;
        • USE switches from the previous db to the “sakila” db
      • Example: View all attached dbs

        SELECT database_name, path, type FROM duckdb_databases;
      • Example: Copy table from one db type to another

        CREATE TABLE mysql.film AS FROM sqlite.film;
        CREATE TABLE postgres.actor AS FROM sqlite.actor;
      • Example: Joins

        SELECT first_name, last_name
        FROM mysql.film
        JOIN sqlite.film_actor ON (film.film_id = film_actor.film_id)
        JOIN postgres.actor ON (actor.actor_id = film_actor.actor_id)
        WHERE title = 'ACE GOLDFINGER';