Relational

Misc

  • Also see SQL >> R

  • Packages

    • {dbplyr}
      • compute stores results in a remote temporary table
      • collect retrieves data into a local tibble.
      • collapse doesn’t force computation, but instead forces generation of the SQL query.
        • sometimes needed to work around bugs in dplyr’s SQL generation.
    • {dm}
      • Can join multiple tables from a db, but keeps the meta info such as table names, primary and foreign keys, size of original tables etc.
    • {dbx} - Convenience functions for insert, update, upsert, and delete
      • Easy parameterization
      • High performance batch operations
      • Date and time support
      • Works well with auto-incrementing primary keys
      • Connection Pooling
      • Connects with DBI, so can also use with {dbplyr}
      • Supports Postgres, MySQL, MariaDB, SQLite, SQL Server, and more
  • Relational databases do not keep all data together but split it into multiple smaller tables. That separation into sub-tables has several advantages:

    • All information is stored only once, avoiding repetition and conserving memory
    • All information is updated only once and in one place, improving consistency and avoiding errors that may result from updating the same value in multiple locations
    • All information is organized by topic and segmented into smaller tables that are easier to handle
  • Optimized for a mix of read and write queries that insert/select a small number of rows at a time and can handle up to 1TB of data reasonably well.

  • The main difference between a “relational database” and a “data warehouse” is that the former is created and optimized to “record” data, whilst the latter is created and built to “react to analytics”.

  • Types

    • Embedded aka In-Process (see Databases, Engineering >> Terms): DuckDB (analytics) and SQLite (transactional)
    • Server-based: postgres, mysql, SQL Server
      • Mix of transactional and analytical
      • Distributed SQL (database replicants across regions or hybrid (on-prem + cloud)
        • mysql, postgres available for both in AWS Aurora (See below)
        • postgres available using yugabytedb
        • SQL Server on Azure SQL Database
        • Cloud Spanner on GCP
  • Apache Avro

    • Row storage file format unlike parquet
    • A single Avro file contains a JSON-like schema for data types and the data itself in binary format
    • 4x slower reading than csv but 1.5x faster writing than csv
    • 1.7x smaller file size than csv
  • Wrapper for db connections (e.g. con_depA <- connect_databaseA(username = ..., password = ...) )

    # ... other stuff including code for "connect_odbc" function
    
    # connection attempt loop
    while(try < retries) {
        con <- connect_odbc(source_db = "<database name>"
                            username = username,
                            password = password)
        if(class(con) == "NetexxaSQL") {
            try <- retries + 1
        } else if (!"NetezzaSQL" %in% class(con) & try < retries {
            warning("<database name> connection failed. Retrying...")
            try <- try + 1
            Sys.sleep(retry_wait)
        } else {
            try <- try + 1
            warning("<database name> connection failed")
        }
    }
    • Guessing “NetezzaSQL” is some kind of error code for a failed connection to the db

Comparisons

  • SQLite vs DuckDB

    • SQLite
      • Row data storage
      • Good for transactional processing
      • Each row represented in a table is stored as an array on disk
      • Easy to update individual rows
    • DuckDB
      • Columnar data storage
      • Good for analytics
      • Each column is stored as a chunked array on disk and in memory
        • Memory locality

        • Column compression

      • Faster queries for reading
  • SQLite vs MySQL as transactional dbs (article)

    • SQLite:
      • Embedded, size ~600KB
      • Limited data types
      • Being self-contained, other clients on a network would not have access to the database (no multi-users) unlike with MySQL
      • No built-in authentication that is supported
      • Multiple processes are able to access the database at the same time, but making changes at the same time is not something supported
      • Use Cases
        • Data being confined in the files of the device is not a problem
        • Network access to the db is not needed
        • Applications that will minimally access the database and not require heavy calculations
    • MySQL:
      • Opposites of the sqlite stuff
      • Size ~600MB
      • Supports replication and scalability
      • Security is a large; built-in features to keep unwanted people from easily accessing data
      • Use cases
        • Transactions are more frequent like on web or desktop applications
        • If network capabilities are a must
        • Multi-User access and therefore security and authentication
        • Large amounts of data
  • Benchmarks

    • Example
      • Data

        • ~54,000,000 rows and 6 columns
        • 10 .rds files with gz compression is 220MB total,
          • If they were .csv, 1.5 GB
        • SQLite file is 3 GB
        • DuckDB file is 2.5 GB
        • Arrow creates a structure of directories, 477 MB total
      • Operation: read, filter, group_by, summarize

      • Results

        ##  format          median_time mem_alloc
        ##  <chr>              <bch:tm> <bch:byt>
        ## 1 R (RDS)              1.34m    4.08GB
        ## 2 SQL (SQLite)          5.48s    6.17MB
        ## 3 SQL (DuckDB)          1.76s  104.66KB
        ## 4 Arrow (Parquet)      1.36s  453.89MB
    • Tradional relational db solutions balloon up the file size
      • SQLite 2x, DuckDB 1.66x (using csv size)
    • ClickBench — a Benchmark For Analytical DBMS

Brands

  • MySQL

    • Installation docs
    • Basic intro
    • See SQL notebook
  • SQLite

    • Packages
      • {RSQLite}
    • Resources
    • Tools
      • datasette.io: See Big Data >> Larger Than Memory >> Programs
      • sqlime: Online SQLite playground
      • SQL Studio - Explorer with dashboard-like UI
        • SQL Database Explorer [SQLite, libSQL, PostgreSQL, MySQL/MariaDB, DuckDB, ClickHouse]
        • Overview page with common metadata.
        • Tables page with each table’s metadata, including the disk size being used by each table.
        • Infinite scroll rows view.
        • A custom query page that gives you more access to your db.
      • SQLite Browser
        • DB browser with spreadsheet-like interface that’s designed for people who want to create, search, and edit SQLite database files
        • All OSs supported
      • litecli
        • CLI tool for SQLite databases that has auto-completion and syntax highlighting
    • Extensions
      • sqlite-vec (intro) - A vector search extension written entirely in C with no dependencies
  • Cloud SQL - Google service to provide hosting services for relational dbs (see Google, BigQuery >> Misc). Can use postgres, mysql, etc. on their machines.

  • AWS RDS for db instances (see Database, postgres >> AWS RDS)

    • Available: Amazon Aurora, MySQL, MariaDB, postgres, Oracle, Microsoft SQL Server
    • RDS (Relational Database Service)
      • Benefits over hosting db on EC2: AWS handles scaling, availability, backups, and software and operating system updates
  • AWS Aurora - MySQL- and PostgreSQL-compatible enterprise-class database