Relational
Misc
Also see SQL >> R
Packages
- {dbplyr}
compute
stores results in a remote temporary tablecollect
retrieves data into a local tibble.collapse
doesn’t force computation, but instead forces generation of the SQL query.- sometimes needed to work around bugs in dplyr’s SQL generation.
- {dm}
- Can join multiple tables from a db, but keeps the meta info such as table names, primary and foreign keys, size of original tables etc.
- {dbx} - Convenience functions for insert, update, upsert, and delete
- Easy parameterization
- High performance batch operations
- Date and time support
- Works well with auto-incrementing primary keys
- Connection Pooling
- Connects with DBI, so can also use with {dbplyr}
- Supports Postgres, MySQL, MariaDB, SQLite, SQL Server, and more
- {dbplyr}
Relational databases do not keep all data together but split it into multiple smaller tables. That separation into sub-tables has several advantages:
- All information is stored only once, avoiding repetition and conserving memory
- All information is updated only once and in one place, improving consistency and avoiding errors that may result from updating the same value in multiple locations
- All information is organized by topic and segmented into smaller tables that are easier to handle
Optimized for a mix of read and write queries that insert/select a small number of rows at a time and can handle up to 1TB of data reasonably well.
The main difference between a “relational database” and a “data warehouse” is that the former is created and optimized to “record” data, whilst the latter is created and built to “react to analytics”.
Types
- Embedded aka In-Process (see Databases, Engineering >> Terms): DuckDB (analytics) and SQLite (transactional)
- Server-based: postgres, mysql, SQL Server
- Mix of transactional and analytical
- Distributed SQL (database replicants across regions or hybrid (on-prem + cloud)
- mysql, postgres available for both in AWS Aurora (See below)
- postgres available using yugabytedb
- SQL Server on Azure SQL Database
- Cloud Spanner on GCP
Apache Avro
- Row storage file format unlike parquet
- A single Avro file contains a JSON-like schema for data types and the data itself in binary format
- 4x slower reading than csv but 1.5x faster writing than csv
- 1.7x smaller file size than csv
Wrapper for db connections (e.g.
con_depA <- connect_databaseA(username = ..., password = ...)
)# ... other stuff including code for "connect_odbc" function # connection attempt loop while(try < retries) { <- connect_odbc(source_db = "<database name>" con username = username, password = password) if(class(con) == "NetexxaSQL") { <- retries + 1 try else if (!"NetezzaSQL" %in% class(con) & try < retries { } warning("<database name> connection failed. Retrying...") <- try + 1 try Sys.sleep(retry_wait) else { } <- try + 1 try warning("<database name> connection failed") }}
- Guessing “NetezzaSQL” is some kind of error code for a failed connection to the db
Comparisons
SQLite vs DuckDB
- SQLite
- Row data storage
- Good for transactional processing
- Each row represented in a table is stored as an array on disk
- Easy to update individual rows
- DuckDB
- Columnar data storage
- Good for analytics
- Each column is stored as a chunked array on disk and in memory
Memory locality
Column compression
- Faster queries for reading
- SQLite
SQLite vs MySQL as transactional dbs (article)
- SQLite:
- Embedded, size ~600KB
- Limited data types
- Being self-contained, other clients on a network would not have access to the database (no multi-users) unlike with MySQL
- No built-in authentication that is supported
- Multiple processes are able to access the database at the same time, but making changes at the same time is not something supported
- Use Cases
- Data being confined in the files of the device is not a problem
- Network access to the db is not needed
- Applications that will minimally access the database and not require heavy calculations
- MySQL:
- Opposites of the sqlite stuff
- Size ~600MB
- Supports replication and scalability
- Security is a large; built-in features to keep unwanted people from easily accessing data
- Use cases
- Transactions are more frequent like on web or desktop applications
- If network capabilities are a must
- Multi-User access and therefore security and authentication
- Large amounts of data
- SQLite:
Benchmarks
- Example
Data
- ~54,000,000 rows and 6 columns
- 10 .rds files with gz compression is 220MB total,
- If they were .csv, 1.5 GB
- SQLite file is 3 GB
- DuckDB file is 2.5 GB
- Arrow creates a structure of directories, 477 MB total
Operation: read, filter, group_by, summarize
Results
## format median_time mem_alloc ## <chr> <bch:tm> <bch:byt> ## 1 R (RDS) 1.34m 4.08GB ## 2 SQL (SQLite) 5.48s 6.17MB ## 3 SQL (DuckDB) 1.76s 104.66KB ## 4 Arrow (Parquet) 1.36s 453.89MB
- Tradional relational db solutions balloon up the file size
- SQLite 2x, DuckDB 1.66x (using csv size)
- ClickBench — a Benchmark For Analytical DBMS
- Example
Brands
MySQL
SQLite
- Packages
- {RSQLite}
- Resources
- Tools
- datasette.io: See Big Data >> Larger Than Memory >> Programs
- sqlime: Online SQLite playground
- SQL Studio - Explorer with dashboard-like UI
- SQL Database Explorer [SQLite, libSQL, PostgreSQL, MySQL/MariaDB, DuckDB, ClickHouse]
- Overview page with common metadata.
- Tables page with each table’s metadata, including the disk size being used by each table.
- Infinite scroll rows view.
- A custom query page that gives you more access to your db.
- SQLite Browser
- DB browser with spreadsheet-like interface that’s designed for people who want to create, search, and edit SQLite database files
- All OSs supported
- litecli - CLI tool for SQLite databases that has auto-completion and syntax highlighting
- Litestream - Enables streaming backups
- LiteFS - Provides distributed access
- CR-SQLite - Allows the use of CRDTs to avoid needing conflict resolution when merging changesets
- CRDT (wiki) - In distributed computing, a conflict-free replicated data type (CRDT) is a data structure that is replicated across multiple computers in a network, with the following features:
- The application can update any replica independently, concurrently and without coordinating with other replicas.
- An algorithm (itself part of the data type) automatically resolves any inconsistencies that might occur.
- Although replicas may have different state at any particular point in time, they are guaranteed to eventually converge.
- CRDT (wiki) - In distributed computing, a conflict-free replicated data type (CRDT) is a data structure that is replicated across multiple computers in a network, with the following features:
- Extensions
- sqlite-vec (intro) - A vector similarity search extension written entirely in C with no dependencies
- sqlite-rembed: A SQLite extension for generating text embeddings from remote APIs (OpenAI, Nomic, Ollama, llamafile…)
- Packages
Cloud SQL - Google service to provide hosting services for relational dbs (see Google, BigQuery >> Misc). Can use postgres, mysql, etc. on their machines.
- Cloud SQL Insights - good query optimization tool
AWS RDS for db instances (see Database, postgres >> AWS RDS)
- Available: Amazon Aurora, MySQL, MariaDB, postgres, Oracle, Microsoft SQL Server
- RDS (Relational Database Service)
- Benefits over hosting db on EC2: AWS handles scaling, availability, backups, and software and operating system updates
AWS Aurora - MySQL- and PostgreSQL-compatible enterprise-class database
- Starting at <$1/day.
- Supports up to 64TB of auto-scaling storage capacity, 6-way replication across three availability zones, and 15 low-latency read replicas.
- Create MySQL and Postgres instances using AWS Cloudformation