Relational

Misc

  • Also see SQL >> R

  • Packages

    • {dbplyr}
      • compute stores results in a remote temporary table
      • collect retrieves data into a local tibble.
      • collapse doesn’t force computation, but instead forces generation of the SQL query.
        • sometimes needed to work around bugs in dplyr’s SQL generation.
    • {dm}
      • Can join multiple tables from a db, but keeps the meta info such as table names, primary and foreign keys, size of original tables etc.
    • {dbx} - Convenience functions for insert, update, upsert, and delete
      • Easy parameterization
      • High performance batch operations
      • Date and time support
      • Works well with auto-incrementing primary keys
      • Connection Pooling
      • Connects with DBI, so can also use with {dbplyr}
      • Supports Postgres, MySQL, MariaDB, SQLite, SQL Server, and more
  • Relational databases do not keep all data together but split it into multiple smaller tables. That separation into sub-tables has several advantages:

    • All information is stored only once, avoiding repetition and conserving memory
    • All information is updated only once and in one place, improving consistency and avoiding errors that may result from updating the same value in multiple locations
    • All information is organized by topic and segmented into smaller tables that are easier to handle
  • Optimized for a mix of read and write queries that insert/select a small number of rows at a time and can handle up to 1TB of data reasonably well.

  • The main difference between a “relational database” and a “data warehouse” is that the former is created and optimized to “record” data, whilst the latter is created and built to “react to analytics”.

  • Types

    • Embedded aka In-Process (see Databases, Engineering >> Terms): DuckDB (analytics) and SQLite (transactional)
    • Server-based: postgres, mysql, SQL Server
      • Mix of transactional and analytical
      • Distributed SQL (database replicants across regions or hybrid (on-prem + cloud)
        • mysql, postgres available for both in AWS Aurora (See below)
        • postgres available using yugabytedb
        • SQL Server on Azure SQL Database
        • Cloud Spanner on GCP
  • Apache Avro

    • Row storage file format unlike parquet
    • A single Avro file contains a JSON-like schema for data types and the data itself in binary format
    • 4x slower reading than csv but 1.5x faster writing than csv
    • 1.7x smaller file size than csv
  • Wrapper for db connections (e.g. con_depA <- connect_databaseA(username = ..., password = ...) )

    # ... other stuff including code for "connect_odbc" function
    
    # connection attempt loop
    while(try < retries) {
        con <- connect_odbc(source_db = "<database name>"
                            username = username,
                            password = password)
        if(class(con) == "NetexxaSQL") {
            try <- retries + 1
        } else if (!"NetezzaSQL" %in% class(con) & try < retries {
            warning("<database name> connection failed. Retrying...")
            try <- try + 1
            Sys.sleep(retry_wait)
        } else {
            try <- try + 1
            warning("<database name> connection failed")
        }
    }
    • Guessing “NetezzaSQL” is some kind of error code for a failed connection to the db

Comparisons

  • SQLite vs DuckDB

    • SQLite
      • Row data storage
      • Good for transactional processing
      • Each row represented in a table is stored as an array on disk
      • Easy to update individual rows
    • DuckDB
      • Columnar data storage
      • Good for analytics
      • Each column is stored as a chunked array on disk and in memory
        • Memory locality

        • Column compression

      • Faster queries for reading
  • SQLite vs MySQL as transactional dbs (article)

    • SQLite:
      • Embedded, size ~600KB
      • Limited data types
      • Being self-contained, other clients on a network would not have access to the database (no multi-users) unlike with MySQL
      • No built-in authentication that is supported
      • Multiple processes are able to access the database at the same time, but making changes at the same time is not something supported
      • Use Cases
        • Data being confined in the files of the device is not a problem
        • Network access to the db is not needed
        • Applications that will minimally access the database and not require heavy calculations
    • MySQL:
      • Opposites of the sqlite stuff
      • Size ~600MB
      • Supports replication and scalability
      • Security is a large; built-in features to keep unwanted people from easily accessing data
      • Use cases
        • Transactions are more frequent like on web or desktop applications
        • If network capabilities are a must
        • Multi-User access and therefore security and authentication
        • Large amounts of data
  • Benchmarks

    • Example
      • Data

        • ~54,000,000 rows and 6 columns
        • 10 .rds files with gz compression is 220MB total,
          • If they were .csv, 1.5 GB
        • SQLite file is 3 GB
        • DuckDB file is 2.5 GB
        • Arrow creates a structure of directories, 477 MB total
      • Operation: read, filter, group_by, summarize

      • Results

        ##  format          median_time mem_alloc
        ##  <chr>              <bch:tm> <bch:byt>
        ## 1 R (RDS)              1.34m    4.08GB
        ## 2 SQL (SQLite)          5.48s    6.17MB
        ## 3 SQL (DuckDB)          1.76s  104.66KB
        ## 4 Arrow (Parquet)      1.36s  453.89MB
    • Tradional relational db solutions balloon up the file size
      • SQLite 2x, DuckDB 1.66x (using csv size)
    • ClickBench — a Benchmark For Analytical DBMS

Brands

  • MySQL

    • Installation docs
    • Basic intro
    • See SQL notebook
  • SQLite

    • Packages
      • {RSQLite}
    • Resources
    • Tools
      • datasette.io: See Big Data >> Larger Than Memory >> Programs
      • sqlime: Online SQLite playground
      • SQL Studio - Explorer with dashboard-like UI
        • SQL Database Explorer [SQLite, libSQL, PostgreSQL, MySQL/MariaDB, DuckDB, ClickHouse]
        • Overview page with common metadata.
        • Tables page with each table’s metadata, including the disk size being used by each table.
        • Infinite scroll rows view.
        • A custom query page that gives you more access to your db.
      • SQLite Browser
        • DB browser with spreadsheet-like interface that’s designed for people who want to create, search, and edit SQLite database files
        • All OSs supported
      • litecli - CLI tool for SQLite databases that has auto-completion and syntax highlighting
      • Litestream - Enables streaming backups
      • LiteFS - Provides distributed access
      • CR-SQLite - Allows the use of CRDTs to avoid needing conflict resolution when merging changesets
        • CRDT (wiki) - In distributed computing, a conflict-free replicated data type (CRDT) is a data structure that is replicated across multiple computers in a network, with the following features:
          1. The application can update any replica independently, concurrently and without coordinating with other replicas.
          2. An algorithm (itself part of the data type) automatically resolves any inconsistencies that might occur.
          3. Although replicas may have different state at any particular point in time, they are guaranteed to eventually converge.
    • Extensions
      • sqlite-vec (intro) - A vector similarity search extension written entirely in C with no dependencies
      • sqlite-rembed: A SQLite extension for generating text embeddings from remote APIs (OpenAI, Nomic, Ollama, llamafile…)
  • Cloud SQL - Google service to provide hosting services for relational dbs (see Google, BigQuery >> Misc). Can use postgres, mysql, etc. on their machines.

  • AWS RDS for db instances (see Database, postgres >> AWS RDS)

    • Available: Amazon Aurora, MySQL, MariaDB, postgres, Oracle, Microsoft SQL Server
    • RDS (Relational Database Service)
      • Benefits over hosting db on EC2: AWS handles scaling, availability, backups, and software and operating system updates
  • AWS Aurora - MySQL- and PostgreSQL-compatible enterprise-class database