SQL

Misc

  • Resources

  • dplyr::show_query can convert a dplyr expression to SQL for db object (e.g. dbplyr,  duckdb, arrow)

  • Queries in examples

    • Window Functions
      • Average Salary by Job Title
      • Average Unit Price for each CustomerId
      • Rank customers by amount spent
      • Create a new column that ranks Unit Price in descending order for each CustomerId
      • Create a new column that provides the previous order date’s Quantity for each ProductId
      • Create a new column that provides the very first Quantity ever ordered for each ProductId
      • Calculate a cumulative moving average UnitPrice for each CustomerId
      • Rank customers for each department by amount spent
      • Find the model and year of the car that been on the lot the longest
      • Create a subset (CTE)
        • Calculate a running monthly total (aka cumsum)
          • Also running average
        • Calculate a running monthly total for each account id
        • Calculate a 3 months rolling running total using a window that includes the current month.
        • Calculate a 7 months rolling running total using a window where the current month is always the middle month
        • Calculate the number of consecutive days spent in each country
    • CTE
      • Average monthly cost per campaign for the company’s marketing efforts
      • Count the number of interactions of new users
      • The average top Math test score for students in California
    • Business Queries
      • 7-day Simple Moving Average (SMA)
      • Rank product categories by shipping cost for each shipping address
      • Total order quantity for each month
      • Monthly and yearly total orders and average shipping costs
      • Daily counts of open jobs (where “open” is an untracked daily status)
      • Get the latest order from each customer
      • Overall median price
      • Median price for each product
      • Overall median price and quantity
    • Processing Expressions
      • Provide subtotals for a hierarchical group of fields (e.g. family, category, subcategory)
        • See NULLs >> COALESCE
  • Order of Operations

    • Higher ranked functions can be inserted inside lower ranked functions
      • e.g a window function can be inside a SELECT function but not inside a WHERE clause
      • There are exceptions and hacks around this in some cases
  • Types of Commands

    • Data Query Language (DQL) - used to find and view data without making any permanent changes to the database.
    • Data Manipulation Language (DML) - used to make permanent changes to the data, such as updating values or deleting them.
    • Data Definition Language (DDL) - used to make permanent changes to the table, such as creating or deleting a table.
    • Data Control Language (DCL) - used for administrative commands, such as adding or removing users of different tables and databases.
    • Transact Control Language (TCL) - advanced SQL that deals with transaction level statements.
  • Microsoft SQL Server format for referencing a table:

    • [database].[schema].[tablename]

    • Alternative

      USE my_data_base
      GO

Set-up

  • postgres
    • Download postgres
    • pgAdmin is an IDE commonly used with postgres
      • Open pgAdmin and click on “Add new server.”
        • Sets up connection to existing server so make sure postgres is installed beforehand
      • Create Tables
        • home >> Databases (1) >> postgres >> Query Tool
      • If needed, give permission to pgAdmin to access data from a folder
        • Might be necessary to upload csv files
      • Import csv file
        • right-click the table name >> Import/Export
        • Options tab
          • Select import, add file path to File Name, choose csv for format, select Yes for Header, add , for Delimiter
        • Columns tab
          • uncheck columns not in the csv (probably the primary key)
            • Wonder if NULLs will be automatically inserted for columns in the table that aren’t in the file.
    • psql - PostgreSQL interactive terminal. Allows you to script and automate tasks, e.g. tests, check errors, and do data entry on the command line.
      • Cheatsheet
      • Requires
        1. The user name and password for your PostgreSQL database
        2. The IP address of your remote instance if working remotely
      • The db=# prompt indicates the command line interface (CLI) prompt. Here’s a breakdown of what each part means:
        • db: This is the name of the current database to which you are connected.
        • =#: This is the prompt symbol that appears when you are logged in as a superuser (like postgres). If you were logged in as a regular user, the prompt would be =>
        • Example:
          • Input

            db=# CREATE TABLE tags (
              id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
              name VARCHAR(50) NOT NULL
            );
          • Output

            CREATE TABLE
            • The way psql indicates the command was successful is by confusingly printing the command.

Terms

  • Batch - a set of sql statements e.g. statements between BEGIN and END
  • Compiled object - you can create a function written in C/C++ and load into the database (at least in postgres) to achieve high performance.
  • Correlated Columns - tells how good the match between logical and physical ordering is.
  • Correlated/Uncorrelated Subqueries
    • correlated- a type of query, where inner query depends upon the outcome of the outer query in order to perform its execution

      • A correlated subquery can be thought of as a filter on the table that it refers to, as if the subquery were evaluated on each row of the table in the outer query
    • uncorrelated - a type of sub-query where inner query doesn’t depend upon the outer query for its execution.

      • It is an independent query, the results of which are returned to and used by the outer query once (not per row).
      -- Uncorrelated subquery:
      -- inner query, c1, only depends on table2
      select c1, c2
        from table1 where c1 = (select max(x) from table2);
      
      -- Correlated subquery:
      -- inner query, c1, depends on table1 and table2
      select c1, c2
        from table1 where c1 = (select x from table2 where y = table1.c2);
  • Functions - Execute at a different level of priority and are handled differently than Views. You will likely see better performance.
  • Idempotent - Providing the same input should always produce the same output, e.g. A plain INSERT is not idempotent because executing the command with the same input for the second time will trigger an error.
  • Index - A quick lookup table (e.g. field or set of fields) for finding records users need to search frequently. An index is small, fast, and optimized for quick lookups. It is very useful for connecting the relational tables and searching large tables. (also see DB, Engineering >> Cost Optimizations)
  • Migrations (schema) - A version control system for your database schema. Management of incremental, reversible changes and version control to relational database schemas. A schema migration is performed on a database whenever it is necessary to update or revert that database’s schema to some newer or older version.
  • Physical Ordering - A PostgreSQL table consists of one or more files of 8KB blocks (or “pages”). The order in which the rows are stored in the file is the physical ordering.
  • Predicate - Defines a logical condition being applied to rows in a table. (e.g. IN, EXISTS, BETWEEEN, LIKE, ALL, ANY)
  • Scalar/Non-Scalar Subqueries
    • A scalar subquery returns a single value (one column of one row). If no rows qualify to be returned, the subquery returns NULL.
    • A non-scalar subquery returns 0, 1, or multiple rows, each of which may contain 1 or multiple columns. For each column, if there is no value to return, the subquery returns NULL. If no rows qualify to be returned, the subquery returns 0 rows (not NULLs).
  • Selectivity - The fraction of rows in a table or partition that is chosen by the predicate
    • Refers to the quality of a filter in its ability to reduce the number of rows that will need to be examined and ultimately returned
      • With a high selectivity, using the primary key or indexes to get right to the rows of interest
      • With a low selectivity, a full table scan would likely be needed to get the rows of interest.
    • Higher selectivity means: more unique data; fewer duplicates; fewer number of rows for each key value
    • Used to estimate the cost of a particular access method; it is also used to determine the optimal join order. A poor choice of join order by the optimizer could result in a very expensive execution plan.
  • Soft-deleted - An operation in which a flag is used to mark data as unusable, without erasing the data itself from the database
  • Surrogate key - Very similar to a primary key in that it is a unique value for an object in a table. However, rather than being derived from actual data present in the table, it is a field generated by the object itself. It has no business value like a primary key does, but is rather only used for data analysis purposes. Can be generated using different columns that already exist in your table or more often from two or more tables. dbt function definition
    • Examples: PostgreSQL serial column, Oracle sequence column, or MySQL auto_increment column, Snowflake _file + _line columns
    • Example: Each employee id is concatenated with a department id (e.g. marketing or finance)
  • Transaction - A set of queries tied together such that if one query fails, the entire set of queries are rolled back to a pre-query state if the situation dictates.
    • A database transaction, by definition, must be atomic, consistent, isolated and durable. These are popularly known as ACID properties.  These properties can ensure the concurrent execution of multiple transactions without conflict.
  • Views - Database objects that represent saved SELECT queries in “virtual” tables.
    • Contains a query plan.  For each query executed, the planner has to evaluate what’s being asked and calculate an optimal path. Views already have this plan calculated so it allows subsequent queries to be returned with almost no friction in processing aside from data retrieval.
    • Some views are updateable but under certain conditions (1-1 mapping of rows in view to underlying table, no group_by, etc.)

Basics

Misc

  • DROP TABLE <table name> - Deletes table from database. Used to remove obsolete or redundant tables from the database.
  • Comments
    • Single Line

      -- Select all:
      SELECT * FROM Customers;
      
      SELECT * FROM Customers -- WHERE City='Berlin';
      • Any text between -- and the end of the line will be ignored (will not be executed)
    • Multi-Line

      /*Select all the columns
      of all the records
      in the Customers table:*/
      SELECT * FROM Customers;

Create Tables

  • If you don’t include the schema as part of the table name (e.g. schema_name.table_name), pgadmin automatically places it into the “public” schema directory

  • Field Syntax: name, data type, constraints

  • Example: Create table as select (CTAS)

    CREATE TABLE new_table AS 
    SELECT * 
    FROM old_table 
    WHERE condition;
  • Example: Table 1 (w/primary key)

    DROP TABLE IF EXISTS classrooms CASCADE;
    CREATE TABLE classrooms (
        id INT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
        teacher VARCHAR(100)
        );
     -- OR
    CREATE TABLE classrooms ( 
        id INT GENERATED ALWAYS AS IDENTITY, 
        teacher VARCHAR(100)
        PRIMARY KEY(id
        );   
    • “classrooms” is the name of the table; “id” and “teacher” are the fields

    • CASCADE - postgres won’t delete the table if other tables point to it, so cascade will override measure.

    • GENERATED ALWAYS AS IDENTITY - makes it so you don’t have to keep track of which “id” values have been used when adding rows. You can ommit the value for “id” and just add the values for the other fields

      • Also seen id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
    • See tutorial for options, usage, removing, adding, etc. this constraint

      INSERT INTO classrooms
          (teacher)
      VALUES
          ('Mary'),
          ('Jonah');
      • Also see Add Data >> Example: chatGPT
  • Example: Table 2 (w/foreign key)

    DROP TABLE IF EXISTS students CASCADE;
    CREATE TABLE students (
        id INT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
        name VARCHAR(100),
        classroom_id INT,
        CONSTRAINT fk_classrooms
            FOREIGN KEY(classroom_id)
            REFERENCES classrooms(id)
    );
    • “students” is the name of the table; “id”, “name”, and “classroom_id” are the fields
    • Create a foreign key that points to the “classrooms” table
      • Expression

        • fk_classrooms is the name of the CONSTRAINT
        • “classroom_id” is the field that will be the FOREIGN KEY
        • REFERENCES points the foreign key to the classrooms table’s primary key, “id”
          • foreign keys can point to any table
      • Postgres won’t allow you to insert a row into students with a “classroom_id” that doesn’t exist in the “id” field of classrooms but will allow you to use a NULL placeholder

        -- Explicitly specify NULL
        INSERT INTO students
            (name, classroom_id)
        VALUES
            ('Dina', NULL); 
        
        -- Implicitly specify NULL
        INSERT INTO students
            (name)
        VALUES
            ('Evan');
        • Also see Add Data >> Example: chatGPT
  • Example

    CREATE TABLE members (
        id serial primary key,
        second_name character varying(200) NOT NULL,
        date_joined date NOT NULL DEFAULT current_date,
        member_id integer references members(id),
        booking_start_time timestamp without timezone NOT NULL
    • The “serial” data type does the same thing as GENERATED ALWAYS AS IDENTITY (see first example), but is NOT compliant with the SQL standard. Use GENERATED ALWAYS AS IDENTITY
    • “references” seems to be another old way to create foreign keys (see 2nd example for proper way)
    • “character varying” - variable-length with limit (e.g limit of 200 characters)
      • character(n), char(n) are for fixed character lengths; text is for unlimited character lengths
    • “current_date” is a function that will insert the current date as a value
    • “timestamp without timezone” is literally that
      • also available: time with/without timezone, date, interval (see Docs for details)
  • Example: MySQL

    CREATE DATABASE products;
    
    CREATE TABLE `products`.`prices` (
      `pid` int(11) NOT NULL AUTO_INCREMENT,
      `category` varchar(100) NOT NULL,
      `price` float NOT NULL,
      PRIMARY KEY (`pid`)
    );
    
    INSERT INTO products.prices
        (pid, category, price)
    VALUES
        (1, 'A', 2),
        (2, 'A', 1),
        (3, 'A', 5),
        (4, 'A', 4),
        (5, 'A', 3),
        (6, 'B', 6),
        (7, 'B', 4),
        (8, 'B', 3),
        (9, 'B', 5),
        (10, 'B', 2),
        (11, 'B', 1)
    ;

Add Data

  • Example: Add data via .csv

    COPY assignments(category, name, due_date, weight)
    FROM 'C:/Users/mgsosna/Desktop/db_data/assignments.csv'
    DELIMITER ','
    CSV HEADER;
    • assignments is the table; [category][assignments], [name][assignments], [due_date][assignments], [weight][assignments] are fields that you want to import from the csv file
    • ** The order of the columns must be the same as the ones in the CSV file **
    • HEADER keyword to indicate that the CSV file contains a header
    • Might need to have superuser access in order to execute the COPY statement successfully

Update Table

  • Resources

  • Check if a table is updatable

    SELECT table_name, is_updatable
    FROM information_schema.views
    • Useful if some of the tables you are working with are missing values that you need to add
    • If not updatable, then you’ll need to contact the database administrator to request permission to update that specific table
  • Update target table by transaction id (BQ)(link)

    insert target_table (transaction_id)
      select transaction_id 
      from source_table 
      where transaction_id > (select max(transaction_id) from target_table)
    ;
    • Might not be possible with denormalized star-schema datasets in modern data warehouses.
  • Update Target Table by transaction date (BQ) (link)

    merge last_online t
    using (
      select
          user_id,
          last_online
      from
        (
            select
                user_id,
                max(timestamp) as last_online
    
            from 
                connection_data
            where
                date(_partitiontime) >= date_sub(current_date(), 
                                                 interval 1 day)
            group by
                user_id
    
        ) y
    
    ) s
    on t.user_id = s.user_id
    when matched then
      update set last_online = s.last_online, 
                 user_id = s.user_id
    when not matched then
      insert (last_online, user_id) 
        values (last_online, user_id)
    ;
    select * from last_online
    ;
    • MERGE performs UPDATE, DELETE, and INSERT
      • UPDATE or DELETE clause can be used when two or more data match.
      • INSERT clause can be used when two or more data are different and do not match.
      • The UPDATE or DELETE clause can also be used when the given data does not match the source.
    • _partitiontime is a field BQ creates to record the row’s ingestion time (See Google, Big Query >> Optimization >> Partitions
  • Update Target Table with Source Data (link)

    MERGE INTO target_table tgt
    USING source_table src 
     ON tgt.customer_id = src.customer_id
    WHEN MATCHED THEN
     UPDATE SET
       tgt.is_active = src.is_active,
       tgt.updated_date = '2024-04-01'::DATE
    WHEN NOT MATCHED THEN
     INSERT
       (customer_id, is_active, updated_date)
     VALUES
     (src.customer_id, src.is_active, '2024-04-01'::DATE)
    ; 
    • The statement uses the MERGE keyword to conditionally update or insert rows into a target table based on a source table.
    • It matches rows between the tables using the ON clause and the customer_id column.
    • The WHEN MATCHED THEN clause specifies the update actions for matching rows.
    • The WHEN NOT MATCHED THEN clause specifies the insert actions for rows that don’t have a match in the target table.
    • The ::DATE cast ensures that the updated_date value is treated as a date.

Delete Rows

  • DELETE CASCADE
    • Deletes related rows in child tables when a parent row is deleted from the parent table

    • Apply cascade to foreign key in the child table

      CREATE TABLE parent_table(
          id SERIAL PRIMARY KEY,
          ...
      );
      
      CREATE TABLE child_table(
          id SERIAL PRIMARY KEY,
          parent_id INT,
          FOREIGN_KEY(parent_id) 
             REFERENCES parent_table(id)
             ON DELETE CASCADE
      );
      • In the child table, the parent_id is a foreign key that references the id column of the parent_table.

      • The ON DELETE CASCADE is the action on the foreign key that will automatically delete the rows from the child_table whenever corresponding rows from the parent_table are deleted.

      • Example

        DELETE FROM parent_table 
        WHERE id = 1;
        • id = 1 in parent_table corresponds to parent_id = 1 in child_table. Therefore, all rows matching those conditions in both tables will be deleted.

Subqueries

  • **Using CTEs instead of subqueries make code more readable**

    • Subqueries make it difficult to understand their context in the larger query
    • The only way to debug a subquery is by turning it into a CTE or pulling it out of the query entirely.
    • CTEs and subqueries have a similar runtime, but subqueries make your code more complex for no reason.
  • Notes from How to Use SubQueries in SQL

    • Also shows the alt method of creating a temporary table to compute the queries
  • Use cases

    • Filtering rows from a table with the context of another.
    • Performing double-layer aggregations such as average of averages or an average of sums.
    • Accessing aggregations with a subquery.
  • Tables used in examples

    • Store A (store_a)
      • Store B is similar
  • Example: Filtering rows

    select * 
    from sandbox.store_b
    where product_id IN (
        select product_id
        from sandbox.store_b
        group by product_id 
        having count(product_id) >= 3
    );
    • filters the rows with products that have been bought at least three times in store_b
  • Example: Multi-Layer Aggregation

    select avg(average_price.total_value) as average_transaction from (
      select transaction_id, sum(price_paid) as total_value
      from sandbox.store_a
      group by transaction_id
      ) as average_price
    ;
    • computes the average of all transactions
    • can’t apply an average directly, as our table is oriented to product_ids and not to transaction_ids
  • Example: Filtering the table based on an Aggregation

    select @avg_transaction:= avg(agg_table.total_value)
    from (
      select transaction_id, sum(price_paid) as total_value
      from sandbox.store_a
      group by transaction_id
    ) as agg_table;
    
    select * 
    from sandbox.store_a
    where transaction_id in (
      select transaction_id
      from sandbox.store_a
      group by transaction_id
      having sum(price_paid) > @avg_transaction
    )
    • filters transactions that have a value higher than the average (where the output must retain the original product-oriented row)

Joins


  • Cross Join -  acts like an expand_grid; where each value in the join key column gets all combinations of rows in both tables (also see above pic)

    • Efficient join

    • When you add the where clause, the cross join acts similarly to an inner join, except you aren’t joining it on any specified column

    • Example:

      SELECT
        schedule.event,
        calendar.number_of_days
      FROM schedule
      CROSS JOIN calendar
      WHERE schedule.total_number < calendar.number_of_days
      • Only join the row in the “schedule” table with the rows in the “calendar” table that meet the specified condition
  • Natural Join - don’t need to specify join columns; need to have two columns in each table with the same name

    • Use cases
      • There are a lot of common columns with the same name across multiple tables
        • They will all be used as joining keys.
      • You don’t want to type out all of the common columns in select just to avoid outputting the same columns multiple times.
    select *
    from table_a
    natural join table_b
    ;
    
    -- natural + outer
    select *
    from table_a
    natural outer join table_b
    ;

R

Misc

  • Thread regarding querying when permissions/privileges only allow partial access to a table (e.g. specific columns)

{dbplyr}

  • Docs

  • Queries will not be run and data will not be retrieved unless you ask for it: they all return a new tbl_dbi object.

  • Commands

    • Use compute() to run the query and save the results in a temporary in the database
    • Use collect() to retrieve the results to R.
    • Use show_query to see the query.
    • Use explain to describe the query plan
  • Syntax for accessing tables

    con |> 
      tbl(I("catalog_name.schema_name.table_name"))
    con |> 
      tbl(I("schema_name.table_name"))
  • Queries

    • Query without bringing output into memory

      pixar_films <- tbl(con, "pixar_films")
      pixar_films |>
        summarize(.by = film_rating, n = n())
      
      #> # Source:   SQL [4 x 2]
      #> # Database: DuckDB v1.0.0
      #>   film_rating     n
      #>   <chr>       <dbl>
      #> 1 Not Rated       1
      #> 2 PG             10
      #> 3 G              13
      #> 4 N/A             3
    • Query and bring output into memory

      df_pixar_films_by_rating <-
        pixar_films |>
        count(film_rating) |>
        collect()
  • Joins

    • Two tables with the result outside of memory

      academy <- tbl(con, "academy")
      academy_won <-
        academy |>
        filter(status == "Won") |>
        count(film, name = "n_won")
      academy |>
        right_join(pixar_films, join_by(film))
        mutate(n_won = coalesce(n_won, 0L)) |>
        arrange(release_date)
      
      #> # Source:     SQL [?? x 6]
      #> # Database:   DuckDB v1.0.0
      #> # Ordered by: release_date
      #>   film         n_won number release_…¹ run_t…²
      #>   <chr>        <dbl> <chr>  <date>       <dbl>
      #> 1 Toy Story        0 1      1995-11-22      81
      #> 2 A Bug's Life     0 2      1998-11-25      95
      #> 3 Toy Story 2      0 3      1999-11-24      92
      #> 4 Monsters, I…     1 4      2001-11-02      92
      #> 5 Finding Nemo     1 5      2003-05-30     100
      #> 6 The Incredi…     2 6      2004-11-05     115
      #> # ℹ more rows
      #> # ℹ abbreviated names: ¹release_date, …
    • Join a table and a tibble

      academy |>
        left_join(
          pixarfilms::pixar_films, 
          join_by(film), copy = TRUE)
      
      #> # Source:   SQL [?? x 7]
      #> # Database: DuckDB v1.0.0
      #>   film      award_t…¹ status number release_…²
      #>   <chr>     <chr>     <chr>  <chr>  <date>    
      #> 1 Toy Story Animated… Award… 1      1995-11-22
      #> 2 Toy Story Original… Nomin… 1      1995-11-22
      #> 3 Toy Story Adapted … Ineli… 1      1995-11-22
      #> 4 Toy Story Original… Nomin… 1      1995-11-22
      #> 5 Toy Story Original… Nomin… 1      1995-11-22
      #> 6 Toy Story Other     Won S… 1      1995-11-22
      #> # ℹ more rows
      #> # ℹ abbreviated names: ¹award_type, …
      • A temporary table is created on the LHS database.

      • If the RHS comes from a different database, results are temporarily loaded into the local session

      • If the LHS is a in-memory tibble and the RHS is lazy table, then the result is a tibble.

  • Get query from dplyr code

    tbl_to_sql <- function(tbl) {
      dplyr::show_query(tbl) |> 
        capture.output() |> 
        purrr::discard_at(1) |> 
        paste(collapse = " ")
    }

{DBI}

  • Other drivers: RPostgres::Postgres(), RPostgres::Redshift()

  • Connect to a generic database

    get_database_conn <- function() {
    
      conn <-
        DBI::dbConnect(
          drv = odbc::odbc(),
          driver = "driver name here",
          server = "server string here",
          UID = Sys.getenv("DB_USER"),
          PWD = Sys.getenv("DB_PASS"),
          port = "port number here"
        )
    
      return(conn)
    }
  • Connect to or Create a SQLite database

    con <- DBI::dbConnect(drv = RSQLite::SQLite(),
                          here::here("db_name.db"),
                          timeout = 10)
  • Connect to Microsoft SQL Server

    con <- DBI::dbConnect(odbc::odbc(), 
                          Driver = "SQL Server"
                          Server = "SERVER"
                          Database = "DB_NAME"
                          Trusted_Connection = "True"
                          Port = 1433)
  • Connect to MariaDB

    con <- dbConnect(
      RMariaDB::MariaDB(),
      dbname = "CORA",
      username = "guest",
      password = "ctu-relational",
      host = "relational.fel.cvut.cz"
    )
  • Close connection: dbDisconnect(con)

  • List Tables and Fields

    dbListTables(con)
    #> [1] "academy"         "box_office"     
    #> [3] "genres"          "pixar_films"    
    #> [5] "pixar_people"    "public_response"
    
    dbListFields(con, "box_office")
    #> [1] "film"                
    #> [2] "budget"              
    #> [3] "box_office_us_canada"
    #> [4] "box_office_other"    
    #> [5] "box_office_worldwide"
  • Cancel a running query (postgres)

    # Store PID
    pid <- DBI::dbGetInfo(conn)$pid
    
    # Cancel query and get control of IDE back
    # SQL command
    SELECT pg_cancel_backend(<PID>)
    • Useful if query is running too long and you want control of your IDE back
  • Create single tables from a list of tibbles to a database

    purrr::map2(table_names, list_of_tbls, ~ dbWriteTable(con, .x, .y))
  • Load all tables from a database into a list

    tables <- dbListTables(con)
    all_data <- map(tables, dbReadTable, conn = con)
    • Can use map_dfr if all the tables have the same columns
  • Queries

    sql <- "
    SELECT * FROM pixar_films
    WHERE release_date >= '2020-01-01'
    "
    # or with > R 4.0
    sql <- r"(
    SELECT * FROM "pixar_films"
    WHERE "release_date" >= '2020-01-01'
    )"
    
    dbGetQuery(con, sql)
    • dbExecute is for queries/procedures that do not return a result. It does return a numeric that specifies the number of rows affected by the statement.
  • Dynamic queries

    • Resources

      • Converting a SQL script into R, splitting the queries, and parameterizing them (article)
    • Example: MS SQL Server with {glue}

      vars <- c("columns", "you", "want", "to", "select")
      date_var <- 'date_col'
      start_date <- as.Date('2022-01-01')
      today <- Sys.Date()
      tablename <- "yourtablename"
      schema_name <- "yourschema"
      query <- glue_sql(.con = con, "SELECT TOP(10) {`vars`*} FROM {`schema_name`}.{`tablename`} ")
      DBI::dbGetQuery(con, query)
      • vars format collapses the vars vector, separated by commas, so that it resembles a SELECT statement
    • Example: Using params

      quoted_column <- "release_date"
      
      sql <- paste0("SELECT * FROM academy WHERE ", quoted_column, " = ?")
      
      dbGetQuery(con, sql,
        params = list(as.Date("2020-01-01"))
      )
      • ? acts a placeholder that gets filled by the params argument

PRQL

  • Docs

  • {prqlr}

  • A dplyr + SQL hybrid language

  • Using with DuckDB

    library(prqlr); library(duckdb)
    con <- dbConnect(duckdb(), dbdir = ":memory")
    dbWriteTable(con, "mtcars", mtcars)
    "from mtcars | filter cyl > 6 | select {cyl, mpg}" |> 
      prql_compile() |> 
      dbGetQuery(conn = con)

{dm}

  • See Databases, Normalization >> Misc >> Packages >> {dm}

  • Connect to all tables and get the number of rows

    con <- dbConnect(
      RMariaDB::MariaDB(),
      dbname = "CORA",
      username = "guest",
      password = keyring::key_get(...),
      host = "relational.fel.cvut.cz"
    )
    # 3 tables: paper, content, cities
    dm <- dm_from_con(con)
    dm |>
      dm_nrow()
    
    #>  paper content   cites 
    #>   2708   49216    5429
  • Subset one table and do stuff to it

    pixar_dm |>
      dm_zoom_to(academy) |>
      left_join(
        pixar_films, select = release_date)
  • Join all related tables into one wide table

    pixar_dm |>
      dm_flatten_to_tbl(academy)

Best Practices

  • Misc
    • Resources
    • Use aliases only when table names are long enough so that using them improves readability (but choose meaningful aliases)
    • Do not use SELECT *. Explicitly list columns instead
    • Use comments to document business logic
    • A comment at the top should provide a high-level description
    • Use an auto-formatter
    • General Optimizations
      • Removing duplicates or filtering out null values at the beginning of your model will speed up queries
      • Replace complex code with window functions
        • Example: Replace GROUP_BY + TOP with a partition + FIRST_VALUE()

          FIRST_VALUE(test_score) OVER(PARTITION BY student_name ORDER BY test_score DESC)
        • Example: AVG(test_score) OVER(PARTITION BY student_name)

  • CTEs
    • Break down logic in CTEs using WITH … AS
    • The SELECT statement inside each CTE must do a single thing (join, filter or aggregate)
    • The CTE name should provide a high-level explanation
    • The last statement should be a SELECT statement querying the last CTE
  • Joins
    • Use WHERE for filtering, not for joining
    • Favor LEFT JOIN over INNER JOIN; in most cases, it’s essential to know the distribution of NULLs
    • Avoid using”Self-Joins.” Use window functions instead (see Google, BigQuery >> Optimization for details on self-joins)
    • When doing equijoins (i.e., joins where all conditions have the something=another form), use the USING keyword
    • Break-up joins using OR into UNION because SQL uses nested operations for JOIN + OR queries which slow things.
      • Bad
      • Good
  • Style Guide

Indexes

  • Misc

    • An index may consist of up to 16 columns
    • The first column of the index must always be present in the query’s filter, order , join or group operations to be used
  • Create Index on an existing table (postgres)

    CREATE INDEX
        score_index ON grades(score, student);
    • “score_index” is the name of the index
    • “grades” is the name of the table
    • “score” and “student” are fields to be used as the indexes
  • Create Index that only uses a specific character length

    /* mysql */
    CREATE TABLE test (blob_col BLOB, INDEX(blob_col(10)));
    • index only uses the first 10 characters of the column value of a BLOB column type
  • Create index with multiple columns

    /* mysql */
    CREATE TABLE test (
        id        INT NOT NULL,
        last_name  CHAR(30) NOT NULL,
        first_name CHAR(30) NOT NULL,
        PRIMARY KEY (id),
        INDEX name (last_name,first_name)
    )
  • Usage of multiple column index (** order of columns is important **)

    SELECT * FROM test WHERE last_name='Jones';
    SELECT * FROM test
      WHERE last_name='Jones' AND first_name='John';
    SELECT * FROM test
      WHERE last_name='Jones'
      AND (first_name='John' OR first_name='Jon');
    SELECT * FROM test
      WHERE last_name='Jones'
      AND first_name >='M' AND first_name < 'N';
    • Index is used when both columns are used as part of filtering criteria or when only the left-most column is used

    • if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).

  • Invalid usage of multiple column index

    SELECT * FROM test WHERE first_name='John';
    SELECT * FROM test
      WHERE last_name='Jones' OR first_name='John';
    • The “name” index won’t be used in these queries since
      • first_name is NOT the left-most column specified in the index
      • OR is used instead of AND
  • Create index with DESC, ASC

    /* mysql */
    CREATE TABLE t (
      c1 INT, c2 INT,
      INDEX idx1 (c1 ASC, c2 ASC),
      INDEX idx2 (c1 ASC, c2 DESC),
      INDEX idx3 (c1 DESC, c2 ASC),
      INDEX idx4 (c1 DESC, c2 DESC)
    );
    • Used by ORDER BY
      • See Docs to see what operations and index types support Descending Indexes
    • Note: idx_a on column_p, column_q desc is not the same as an * Index idx_a on column_q desc, column p or, * Index idx_b on column_p desc, column q
  • Usage of Descending Indexes

    ORDER BY c1 ASC, c2 ASC    -- optimizer can use idx1
    ORDER BY c1 DESC, c2 DESC  -- optimizer can use idx4
    ORDER BY c1 ASC, c2 DESC  -- optimizer can use idx2
    ORDER BY c1 DESC, c2 ASC  -- optimizer can use idx3
    • See previous example for definition of idx* names
    • See Docs to see what operations and index types support Descending Indexes

Partitioning

Views

  • A smaller data object that contains the subset of data resulting from a specific query

  • If a table structure changes, the database can flag affected views for review.

  • Types

    • Materialized Views
      • Stores data and require refreshing
      • Whereas a query happens after data is loaded, a materialized view is a precomputation (Basically, it’s a cached result)
      • Can offer better performance for complex queries
    • Standard Views
      • Essentially saved queries that run each time you access the view. This means that when you query a standard view, you’re always seeing the current data in the underlying tables.
      • Not separate storage
  • For Materialized Views and in postgres, you just need to replace VIEW with MATERIALIZED VIEW in the commands

  • Create a View

    • Example: Create view as select (CVAS)

      CREATE VIEW high_earner AS
      SELECT p.id AS person_id, j.salary
      FROM People p
      JOIN Job j
      ON p.job = j.title
      WHERE j.salary >= 200000;
  • Query a view (same as a table): SELECT * FROM high_earner

  • Update view

    CREATE OR REPLACE VIEW high_earner AS
    SELECT p.id AS person_id, j.salary
    FROM People p
    JOIN Job j
    ON p.job = j.title
    WHERE j.salary >= 150000;
    • Expects the query output to retain the same number of columns, column names, and column data types. Thus, any modification that results in a change in the data structure will raise an error.
  • Change view name or a column name

    ALTER VIEW foo RENAME TO bar;
    ALTER VIEW foo RENAME COLUMN moose TO squirrel
  • Refresh Materialized View

    • Non-Concurrently

      REFRESH MATERIALIZED VIEW view_name;
      • Might block users while view is being refreshed
      • Best for cases in which a lot of rows are affected.
    • Concurrently

      CREATE UNIQUE INDEX idx_view_name ON view_name (column1, column2);
      REFRESH MATERIALIZED VIEW CONCURRENTLY view_name;
      • Refreshes while not blocking other people using this view (i.e. concurrent access to view)
      • Best for cases in which only a small number of rows are affected
      • Requires index
  • List views

    SELECT * 
    FROM information_schema.views
    WHERE table_schema NOT IN ('pg_catalog', 'information_schema');
    • “table_name” has the names of the views
    • “view_definition” shows the query stored in the view
    • WHERE command is included to omit built-in views from PostgreSQL.
  • Delete view: DROP VIEW high_earner;

Variables

  • Misc
  • User-defined
    • DECLARE and SET

      -- Declare your variables
      DECLARE @start date
      DECLARE @stop date
      -- SET the relevant values for each variable
      SET @start = '2021-06-01'
      SET @stop = GETDATE()
      • DECLARE sets the variable type (e.g. date)
      • SET assigns a value
    • Or just use DECLARE

      DECLARE @Iteration Integer = 0;
    • Examples

      • Example: Exclude 3 months of data from the query

        SELECT t1.[DATETIME], COUNT(*) AS vol
        FROM Medium.dbo.Earthquakes t1
        WHERE t1.[DATETIME] BETWEEN @start AND DATEADD(MONTH, -3, @stop)
        GROUP BY t1.[DATETIME]
        ORDER BY t1.[DATETIME] DESC;
        • See above for the definitions of @start and @stop
      • Example: Apply a counter

        -- Declare the variable (a SQL Command, the var name, the datatype)
        DECLARE @counter INT;
        -- Set the counter to 20
        SET @counter = 20;
        -- Print the initial value
        SELECT @counter AS _COUNT;
        -- Select and increment the counter by one
        SELECT @counter = @counter + 1;
        -- Print variable
        SELECT @counter AS _COUNT;
        -- Select and increment the counter by one
        SELECT @counter += 1;
        -- Print the variable
        SELECT @counter AS _COUNT;
  • System
    • ROWCOUNT - returns the number of rows affected by the last previous statement
      • Example

        BEGIN
            SELECT
                product_id,
                product_name
            FROM
                production.products
            WHERE
                list_price > 100000;
            IF @@ROWCOUNT = 0
                PRINT 'No product with price greater than 100000 found';
        END

Functions

  • “||” Concantenate strings. e.g ‘Post’ || ‘greSQL’ –> PostgreSQL

  • BEGINEND - defines a compound statement or statement block. A compound statement consists of a set of SQL statements that execute together. A statement block is also known as a batch

    • A compound statement can have a local declaration for a variable, a cursor, a temporary table, or an exception
      • Local declarations can be referenced by any statement in that compound statement, or in any compound statement nested within it.
      • Local declarations are invisible to other procedures that are called from within a compound statement
  • COMMIT - a transaction control language that is used to permanently save the changes done in the transaction in tables/databases. The database cannot regain its previous state after its execution of commit.

  • DATEADD - adds units of time to a variable or value

    • e.g. DATEADD(month, -3, '2021-06-01')
      • subtracts 3 months from 2021-06-01
  • DATE_TRUNC - pulls a component of a date object.

    • e.g. date_trunc('month', date_var) as month
  • DENSE_RANK- similar to the RANK , but it does not skip any numbers even if there is a tie between the rows.

    • Values are ranked by the column specified in ORDER BY expression of the window function
  • EXPLAIN - a means of running your query as a what-if to see what the planner thinks about it. It will show the process the system goes through to get to the data and return it.

    EXPLAIN
    SELECT
        s.id AS student_id,
        g.score
    FROM
        students AS s
    LEFT JOIN
        grades AS g
        ON s.id = g.student_id
    WHERE
        g.score > 90
    ORDER BY
        g.score DESC;
    /*
    QUERY PLAN
    ----------
    Sort (cost=80.34..81.88 rows=617 width=8)
    [...] Sort Key: g.score DESC
    [...] -> Hash Join (cost=16.98..51.74 rows=617 width=8)
    [...] Hash Cond: (g.student_id = s.id)
    [...] -> Seq Scan on grades g (cost=0.00..33.13 rows=617 width=8)
    [...] Filter: (score > 90)
    [...] -> Hash (cost=13.10..13.10 rows=310 width=4)
    [...] -> Seq Scan on students s (cost=0.00..13.20 rows=320 width=4)
    */
    • Sequentially scanning (“Seq Scan”) the grades and students tables because the tables aren’t indexed
      • Any Seq Scan, parallel or not, is sub-optimal
    • EXPLAIN (BUFFERS) also shows how may data pages the database had to fetch using slow disk read operations (“read”), and how many of them were cached in memory (“shared hit”)
    • Also see Databases, Engineering >> Cost Optimizations
  • EXPLAIN ANALYZE - tells the planner to not only hypothesize on what it would do, but actually run the query and show the results.

  • GETDATE() - Gets the current date

  • GO - Not a sql function. Used by some interpreters as a reset.

    • i.e. any variables set before the GO statement will now not be recognized by the interpreter.
    • Helps to separate code into different sections
  • ISDATE - boolean - checks that a variable is a date type

  • QUALIFY - clause filters the results of window functions.

    • useful when answering questions like fetching the most XXX value of each category
    • QUALIFY does with window functions as what HAVING does with GROUP BY. As a result, in the order of execution, QUALIFY is evaluated after window functions.
    • See Business Queries >> Get the latest order from each customer
  • UNION - It joins the outputs of two separate SELECT statements or tables. UNION keeps all rows in the first table/select statement or the second table/select statement. It also retains only one occurrence of duplicated rows if there are any.

    • i.e. It’s like if you row_bind two dataframes and then only keep the unique rows.
  • UNNEST - BigQuery - takes an ARRAY and returns a table with a row for each element in the ARRAY (docs)

User Defined Functions (UDF)

  • Misc

    • Available in SQL Server (Docs1, Docs2), Postgres (Docs), BigQuery (docs), etc.
    • Keep a dictionary with the UDFs you’ve created and make sure to share it with any collaborators.
    • Can be persistent or temporary
      • Persistent UDFs can be used across multiple queries, while temporary UDFs only exist in the scope of a single query
  • Create

    • Example: temporary udf (BQ)

      CREATE TEMP FUNCTION AddFourAndDivide(x INT64, y INT64)
      RETURNS FLOAT64
      AS (
        (x + 4) / y
      );
      SELECT
        val, AddFourAndDivide(val, 2)
      FROM
        UNNEST([2,3,5,8]) AS val;
    • Example: persistent udf (BQ)

      CREATE FUNCTION mydataset.AddFourAndDivide(x INT64, y INT64)
      RETURNS FLOAT64
      AS (
        (x + 4) / y
      );
      
      SELECT
        val, mydataset.AddFourAndDivide(val, 2)
      FROM
        UNNEST([2,3,5,8,12]) AS val;
  • Delete persistent udf: DROP FUNCTION <udf_name>

  • With Scalar subquery (BQ)

    CREATE TEMP FUNCTION countUserByAge(userAge INT64)
    AS (
      (SELECT COUNT(1) FROM users WHERE age = userAge)
    );
    SELECT
      countUserByAge(10) AS count_user_age_10,
      countUserByAge(20) AS count_user_age_20,
      countUserByAge(30) AS count_user_age_30;

Loops

  • Misc
    • posgres docs for loops
  • WHILE
    • Example: Incrementally add to a counter variable

      -- Declare the initial value
      DECLARE @counter INT;
      SET @counter = 20;
      -- Print initial value
      SELECT @counter AS _COUNT;
      -- Create a loop
      BEGIN;
      -- Loop code starting point
      WHILE @counter < 30
      SELECT @counter = @counter + 1;
      -- Loop finish
      END;
      -- Check the value of the variable
      SELECT @counter AS _COUNT;
  • Cursors (Docs)
    • Rather than executing a whole query at once, it is possible to set up a cursor that encapsulates the query, and then read the query result a few rows at a time.

      • One reason for doing this is to avoid memory overrun when the result contains a large number of rows. (However, PL/pgSQL users do not normally need to worry about that, since FOR loops automatically use a cursor internally to avoid memory problems.)
      • A more interesting usage is to return a reference to a cursor that a function has created, allowing the caller to read the rows. This provides an efficient way to return large row sets from functions.
    • Example (article (do not pay attention dynamic sql. it’s for embedding sql in C programs))

      DECLARE
          cur_orders CURSOR FOR 
              SELECT order_id, product_id, quantity
              FROM order_details
              WHERE product_id = 456;
          product_inventory INTEGER;
      BEGIN
          OPEN cur_orders;
          LOOP
              FETCH cur_orders INTO order_id, product_id, quantity;
              EXIT WHEN NOT FOUND;
              SELECT inventory INTO product_inventory FROM products WHERE product_id = 456;
              product_inventory := product_inventory - quantity;
              UPDATE products SET inventory = product_inventory WHERE product_id = 456;
          END LOOP;
          CLOSE cur_orders;
          -- do something after updating the inventory, such as logging the changes
      END;
      • A table called “products” that contains information about all products, including the product ID, product name, and current inventory. You can use a cursor to iterate through all orders that contain a specific product and update its inventory.
      • A cursor called “cur_orders” that selects all order details that contain a specific product ID. We then define a variable called “product_inventory” to store the current inventory of the product.
      • Inside the loop, we fetch each order ID, product ID, and quantity from the cursor, subtract the quantity from the current inventory and update the products table with the new inventory value.
      • Finally, we close the cursor and do something after updating the inventory, such as logging the changes.

Window Functions

  • Unlike GROUP BY, keeps original columns after an aggregation

    • Allows you to work with both aggregate and non-aggregate values all at once
      • Better performance than using GROUP BY + JOIN to get the same result
  • Despite the order of operations, if you really need to have a window function inside a WHERE clause or GROUP BY clause, you may get around this limitation by using a subquery or a WITH query

    • Example: Remove duplicate rows

      WITH temporary_employees as
      (SELECT 
        employee_id,
        employee_name,
        department,
        ROW_NUMBER() OVER(PARTITION BY employee_name,
                                      department,
                                      employee_id) as row_count
      FROM Dummy_employees)
      
      SELECT *
      FROM temporary_employees
      WHERE row_count = 1
  • 3 Types of Window Functions

    • LEAD() will give you the row AFTER the row you are finding a value for.
    • LAG() will give you the row BEFORE the row you are finding a value for.
    • FIRST_VALUE() returns the first value in an ordered, partitioned data output.
  • General Syntax

    • window_function is the name of the window function we want to use (e.g. see above)
      • expression is the name of the column that we want the window function operated on.
        • May not be necessary depending on what window_function is used
      • OVER signifies that this is a window function
    • PARTITION BY divides the rows into partitions so we can specify which rows to use to compute the window function
      • partition_list is the name of the column(s) we want to partition by (i.e. group_by)
    • ORDER BY is used so that we can order the rows within each partition. This is optional and does not have to be specified
      • order_list is the name of the column(s) we want to order by
    • ROWS (optional; typically not used) used to subset the rows within each partition.
      • frame_clause defines how much to offset from our current row
      • Syntax: ROWS BETWEEN <starting_row> AND <ending_row>
        • Options for starting and ending row
          • UNBOUNDED PRECEDING — all rows before the current row in the partition, i.e. the first row of the partition
          • [some #] PRECEDING — # of rows before the current row
          • CURRENT ROW — the current row
          • [some #] FOLLOWING — # of rows after the current row
          • UNBOUNDED FOLLOWING — all rows after the current row in the partition, i.e. the last row of the partition
        • Examples
          • ROWS BETWEEN 3 PRECEDING AND CURRENT ROW — this means look back the previous 3 rows up to the current row.
          • ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING — this means look from the first row of the partition to 1 row after the current row
          • ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING — this means look back the previous 5 rows up to 1 row before the current row
          • ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING — this means look from the first row of the partition to the last row of the partition
  • Example: Average Salary by Job Title

  • Tables for Examples

  • Example: Average Unit Price for each CustomerId

    SELECT CustomerId, 
          UnitPrice, 
          AVG(UnitPrice) OVER (PARTITION BY CustomerId) AS “AvgUnitPrice”
    FROM [Order
    INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
  • Example: Average Unit Price for each group of CustomerId AND EmployeeId

    SELECT CustomerId, 
          EmployeeId, 
          AVG(UnitPrice) OVER (PARTITION BY CustomerId, EmployeeId) AS “AvgUnitPrice”
    FROM [Order
    INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
  • Example: Create a new column that ranks Unit Price in descending order for each CustomerId

    SELECT CustomerId, 
          OrderDate, 
          UnitPrice, 
          ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY UnitPrice DESC) AS “UnitRank”
    FROM [Order
    INNER JOIN OrderDetail 
    ON [Order].Id = OrderDetail.OrderId
    • Substituting RANK in place of ROW_NUMBER should produce the same results
    • Note that ranks are skipped (e.g. rank 3 for ALFKI) when there are rows with the same rank
      • If you don’t want ranks skipped, use DENSE_RANK for the window function
  • Example: Create a new column that provides the previous order date’s Quantity for each ProductId

    SELECT ProductId, 
          OrderDate, 
          Quantity, 
          LAG(Quantity) OVER (PARTITION BY ProductId ORDER BY OrderDate) AS "LAG"
    FROM [Order
    INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
    • Use LEAD for the following quantity
  • Example: Create a new column that provides the very first Quantity ever ordered for each ProductId

    SELECT ProductId, 
          OrderDate, 
          Quantity, 
          FIRST_VALUE(Quantity) OVER (PARTITION BY ProductId ORDER BY OrderDate) AS "FirstValue"
    FROM [Order
    INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
  • Example: Calculate a cumulative moving average UnitPrice for each CustomerId

    SELECT CustomerId, 
          UnitPrice, 
          AVG(UnitPrice) OVER (PARTITION BY CustomerId 
          ORDER BY CustomerId 
          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS “CumAvg”
    FROM [Order]
    INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
  • Example: Rank customers for each department by amount spent

    SELECT
        customer_name,
        customer_id,
        amount_spent,
        department_id,
        RANK(amount_spent) OVER(ORDER BY amount_spent DESC PARTITION BY department_id) AS spend_rank
    FROM employees
  • Example: Find the model and year of car that been on lot the longest

    SELECT 
    FIRST_VALUE(name) OVER(PARTITION BY model, year ORDER BY date_at_lot ASC) AS oldest_car_name
    model,
    year
    FROM cars
  • Running Totals/Averages (Cumulative Sums)

    • Uses SUM as the window function

      • Just replace SUM with AVG to get running averages
    • Example

      • Generate a new dataset grouped by month, instead of timestamp. (CTE)
        • Only include three fields: account_id, occurred_month and total_amount_usd
        • Only computed for the following accounts: 1041 , 1051, 1061, 10141.
      • Compute a running total ordered by occurred_month, without collapsing the rows in the result set.
        • Display 2 columns: occurred_month and cum_amnt_usd_by_month
      • Because no partition was specified, the running total is applied on the full dataset and ordered by (ascending) occurred_month
    • Example running total by grouping variable

      • Using previous CTE
        • Compute a running total by account_id, ordered by occurred_month, and account_id (i.e. a separate running total for each account_id.)
          • Display 3 columns: account_id, occurred_month, and cum_mon_amnt_usd_by_account
      • Same as previous example except a partition column (account_id) is added
    • Example Running total over various window lengths

      • Using previous CTE
        • Compute a 3 months rolling running total using a window that includes the current month.
        • Compute a 7 months rolling running total using a window where the current month is always the middle month.
      • First case uses 2 PRECEDING rows and the CURRENT_ROW
      • Second case uses 3 PRECEDING rows and 3 FOLLOWING rows and the CURRENT_ROW
    • Example: Calculate the number consecutive days spent in each country (sqlite)

      with ordered as (
        select 
          created,
          country,
          lag(country) over (order by created desc)
            as previous_country
        from 
          raw
      ),
      grouped as (
        select 
          country, 
          created, 
          count(*) filter (
            where previous_country is null
            or previous_country != country
          ) over (
            order by created desc
            rows between unbounded preceding
            and current row
          ) as grp
        from 
          ordered
      )
      select
        country,
        date(min(created)) as start,
        date(max(created)) as end,
        cast(
          julianday(date(max(created))) -
          julianday(date(min(created))) as integer
        ) as days
      from 
        grouped
      group by
        country, grp
      order by
        start desc;
      • Post

        • Goes over the code and thought process step-by-step with shows original data and results during intermediate steps
      • Thread

        • Evidently only sqlite and postgres support filter. Someone in the thread suggest an alternate method.
      • Output:

        country         start         end           days
        United Kingdom  2023-06-08  2023-06-08  0
        United States   2019-09-02  2023-05-11  1347
        France          2019-08-25  2019-08-31  6
        Madagascar      2019-07-31  2019-08-07  7
        France          2019-07-25  2019-07-25  0
        United States   2019-05-04  2019-06-30  57
        United Kingdom  2018-08-29  2018-09-10  12
        United States   2018-08-05  2018-08-10  5

Common Table Expressions (CTE)

  • The result set of a query which exists temporarily and for use only within the context of a larger query. Much like a derived table, the result of a CTE is not stored and exists only for the duration of the query.

  • Also see

  • Use Cases

    • Needing to reference a derived table multiple times in a single query
    • An alternative to creating a view in the database
    • Performing the same calculation multiple times over across multiple query components
  • Improves readability and usually no performance difference

  • Steps

    • Initiate a CTE using “WITH”
    • Provide a name for the result soon-to-be defined query
    • After assigning a name, follow with “AS”
    • Specify column names (optional step)
    • Define the query to produce the desired result set
    • If multiple CTEs are required, initiate each subsequent expression with a comma and repeat steps 2-4.
    • Reference the above-defined CTE(s) in a subsequent query
  • Syntax

    WITH
    expression_name_1 AS
    (CTE query definition 1)
    [, expression_name_X AS
      (CTE query definition X)
    , etc ]
    SELECT expression_A, expression_B, ...
    FROM expression_name_1
  • Example

    • Comparison with a “derived” query

    • “What is the average monthly cost per campaign for the company’s marketing efforts?”

    • Using CTE workflow

      -- define CTE:
      WITH Cost_by_Month AS
      (SELECT campaign_id AS campaign,
            TO_CHAR(created_date, 'YYYY-MM') AS month,
            SUM(cost) AS monthly_cost
      FROM marketing
      WHERE created_date BETWEEN NOW() - INTERVAL '3 MONTH' AND NOW()
      GROUP BY 1, 2
      ORDER BY 1, 2)
      
      -- use CTE in subsequent query:
      SELECT campaign, avg(monthly_cost) as "Avg Monthly Cost"
      FROM Cost_by_Month
      GROUP BY campaign
      ORDER BY campaign
    • Used derived query

      -- Derived
      SELECT campaign, avg(monthly_cost) as "Avg Monthly Cost"
      FROM
          -- this is where the derived query is used
          (SELECT campaign_id AS campaign,
            TO_CHAR(created_date, 'YYYY-MM') AS month,
            SUM(cost) AS monthly_cost
          FROM marketing
          WHERE created_date BETWEEN NOW() - INTERVAL '3 MONTH' AND NOW()
          GROUP BY 1, 2
          ORDER BY 1, 2) as Cost_By_Month
      GROUP BY campaign
      ORDER BY campaign
  • Example

    • Count the number of interactions of new users
    • Steps
      • Get new users
      • Count interactions
      • Get interactions of new users
    WITH new_users AS (
        SELECT id
        FROM users
        WHERE created >= '2021-01-01'
    ),
    count_interactions AS (
        SELECT id,
            COUNT(*) n_interactions
        FROM interactions
        GROUP BY id
    ),
    interactions_by_new_users AS (
        SELECT id,
            n_interactions
        FROM new_users
            LEFT JOIN count_interactions USING (id)
    )
    
    SELECT *
    FROM interactions_by_new_users
  • Example

    • Find the average top Math test score for students in California

    • Steps

      1. Get a subset of students (California)
      2. Get a subset of test scores (Math)
      3. Join them together to get all Math test scores from California students
      4. Get the top score per student
      5. Take the overall average
    • Derived Query (i.e. w/o CTE)

      SELECT AVG(score)
      FROM 
        (SELECT students.id, MAX(test_results.score) as score
        FROM students 
        JOIN schools ON (
          students.school_id = schools.id AND schools.state = 'CA'
        )
        JOIN test_results ON (
          students.id = test_results.student_id
          AND test_results.subject = 'math'
        )
        GROUP BY students.id) as tmp
    • Using CTE

      WITH
        student_subset as (
          SELECT students.id 
          FROM students 
          JOIN schools ON (
            students.school_id = schools.id AND schools.state = 'CA'
          )
        ),
        score_subset as (
          SELECT student_id, score 
          FROM test_results 
          WHERE subject = 'math'
        ),
        student_scores as (
          SELECT student_subset.id, score_subset.score
          FROM student_subset 
          JOIN score_subset ON (
              student_subset.id = score_subset.student_id
          )
        ),
        top_score_per_student as (
          SELECT id, MAX(score) as score 
          FROM student_scores 
          GROUP BY id
        )
      
      SELECT AVG(score) 
      FROM top_score_per_student

Strings

  • Concatenate

    • Also see Processing Expressions >> NULLs

    • “||”

      SELECT 'PostgreSQL' || ' ' || 'Databases' AS result;
      
          result
      --------------
      PostgreSQL Databases
    • CONCAT

      SELECT CONCAT('PostgreSQL', ' ', 'Databases') AS result;
      
          result
      --------------
      PostgreSQL Databases
    • With NULL values

      SELECT CONCAT('Harry', NULL, 'Peter');
      
      --------------
      HarryPeter
      • “||” won’t work with NULLs
    • Columns

      SELECT first_name, last_name, 
      CONCAT(first_name,' ' , last_name) "Full Name" 
      FROM candidates;
      • New column, “Full Name”, is created with concatenated columns
  • Splitting (BQ)

    SELECT
    *,
    CASE WHEN ARRAY_LENGTH(SPLIT(page_location, '/')) >= 5 
              AND
              CONTAINS_SUBSTR(ARRAY_REVERSE(SPLIT(page_location, '/'))[SAFE_OFFSET(0)], '+')
              AND (LOWER(SPLIT(page_location, '/')[SAFE_OFFSET(4)]) IN 
                                          ('accessories','apparel','brands','campus+collection','drinkware',
                                            'electronics','google+redesign',
                                            'lifestyle','nest','new+2015+logo','notebooks+journals',
                                            'office','shop+by+brand','small+goods','stationery','wearables'
                                            )
                    OR
                    LOWER(SPLIT(page_location, '/')[SAFE_OFFSET(3)]) IN 
                                          ('accessories','apparel','brands','campus+collection','drinkware',
                                            'electronics','google+redesign',
                                            'lifestyle','nest','new+2015+logo','notebooks+journals',
                                            'office','shop+by+brand','small+goods','stationery','wearables'
                                            )
              )
              THEN 'PDP'
              WHEN NOT(CONTAINS_SUBSTR(ARRAY_REVERSE(SPLIT(page_location, '/'))[SAFE_OFFSET(0)], '+'))
              AND (LOWER(SPLIT(page_location, '/')[SAFE_OFFSET(4)]) IN 
                                            ('accessories','apparel','brands','campus+collection','drinkware',
                                            'electronics','google+redesign',
                                            'lifestyle','nest','new+2015+logo','notebooks+journals',
                                            'office','shop+by+brand','small+goods','stationery','wearables'
                                            )
                    OR 
                    LOWER(SPLIT(page_location, '/')[SAFE_OFFSET(3)]) IN 
                                            ('accessories','apparel','brands','campus+collection','drinkware',
                                              'electronics','google+redesign',
                                              'lifestyle','nest','new+2015+logo','notebooks+journals',
                                              'office','shop+by+brand','small+goods','stationery','wearables'
                                              )
              )
              THEN 'PLP'
          ELSE page_title
          END AS page_title_adjusted 
    FROM 
      unnested_events
    • From article, gist
    • Query is creating a new categorical column, “page_title_adjusted,” that is “PDP” when a substring in “page_location” is one of a set of words, and “PLP” when it’s not, and the value of page_title otherwise.
    • SPLIT splits the string by separator, ‘/’
    • CONTAINS_SUBSTR is looking for substring with a “+”
    • [SAFE_OFFSET(3)] pulls the 4th substring (think this indexes by 0?)
    • After it’s been reversed via ARRAY_REVERSE (?)
    • ELSE says use the value for page_title when length of the substrings after splitting page_location is 5 or less
    • “unnested_events” is a CTE

Arrays

  • Misc

    • PostGres
      • Indexing Arrays starts at 1, not at 0
  • Create Array (BQ)

    SELECT ARRAY
      (SELECT 1 UNION ALL
      SELECT 2 UNION ALL
      SELECT 3) AS new_array;
    +-----------+
    | new_array |
    +-----------+
    | [1, 2, 3] |
    +-----------+
    
    SELECT
      ARRAY
        (SELECT AS STRUCT 1, 2, 3
        UNION ALL SELECT AS STRUCT 4, 5, 6) AS new_array;
    +------------------------+
    | new_array              |
    +------------------------+
    | [{1, 2, 3}, {4, 5, 6}] |
    +------------------------+
    
    SELECT ARRAY
      (SELECT AS STRUCT [1, 2, 3] UNION ALL
      SELECT AS STRUCT [4, 5, 6]) AS new_array;
    +----------------------------+
    | new_array                  |
    +----------------------------+
    | [{[1, 2, 3]}, {[4, 5, 6]}] |
    +----------------------------+
  • Create a table with Arrays (Postgres)

    CREATE TEMP TABLE shopping_cart (
      cart_id serial PRIMARY KEY,
      products text ARRAY
      );
    INSERT INTO
      shopping_cart(products)
    VALUES
      (ARRAY['product_a', 'product_b']),
      (ARRAY['product_c', 'product_d']),
      (ARRAY['product_a', 'product_b', 'product_c']),
      (ARRAY['product_a', 'product_b', 'product_d']),
      (ARRAY['product_b', 'product_d']);
    
    -- alt syntax w/o ARRAY
    INSERT INTO
      shopping_cart(products)
    VALUES
      ('{"product_a", "product_d"}');
    • Also see Basics >> Add Data >> Example: chatGPT
  • Subset an array (postgres)

    SELECT
      cart_id,
      products[1] AS first_product -- indexing starts at 1
    FROM
      shopping_cart;
  • Slice an array (postgres)

    SELECT
      cart_id,
      products [1:2] AS first_two_products
    FROM
      shopping_cart
    WHERE
      CARDINALITY(products) > 2;
  • Unnest an array (postgres)

    SELECT
      cart_id,
      UNNEST(products) AS products
    FROM
      shopping_cart
    WHERE
      cart_id IN (3, 4);
    • Useful if you want to perform a join
  • Filter according to items in arrays (postgres)

    SELECT
      cart_id,
      products
    FROM
      shopping_cart
    WHERE
      'product_c' = ANY (products);
    • Only rows with arrays that have “product_c” will be returned
  • Change array values using UPDATE, SET

    -- update arrays 
    UPDATE
      shopping_cart
    SET
      products = ARRAY['product_a','product_b','product_e']
    WHERE
      cart_id = 1;
    
    UPDATE 
      shopping_cart
    SET
      products[1] = 'product_f'
    WHERE
      cart_id = 2;
    SELECT
      *
    FROM
      shopping_cart
    ORDER BY cart_id;
    • First update: all arrays where cart_id == 1 are set to [‘product_a’,‘product_b’,‘product_e’]

    • Second update: all array first values where cart_id == 2 are set to ‘product_f’

  • Insert array values

    • ARRAY_APPEND - puts value at the end of the array

      UPDATE
        shopping_cart
      SET
        products = ARRAY_APPEND(products, 'product_x')
      WHERE
        cart_id = 1;
      • arrays in product column where cart_id == 1 get “product_x” appended to the end of their arrays
    • ARRAY_PREPEND - puts value at the beginning of the array

      UPDATE 
        shopping_cart
      SET
        products = ARRAY_PREPEND('product_x', products)
      WHERE
        cart_id = 2;
      • arrays in product column where cart_id == 2 get “product_x” prepended to the beginning of their arrays
  • ARRAY_REMOVE - remove array item

    UPDATE
      shopping_cart
    SET
      products = array_remove(products, 'product_e')
    WHERE cart_id = 1;
    • arrays in product column where cart_id == 1 get “product_e” removed from their arrays
  • ARRAY_CONCAT(BQ), ARRAY_CAT(postgres) - Concantenate

    SELECT ARRAY_CONCAT([1, 2], [3, 4], [5, 6]) as count_to_six;
    +--------------------------------------------------+
    | count_to_six                                    |
    +--------------------------------------------------+
    | [1, 2, 3, 4, 5, 6]                              |
    +--------------------------------------------------+
    
    -- postgres
    SELECT
      cart_id,
      ARRAY_CAT(products, ARRAY['promo_product_1', 'promo_product_2'])
    FROM shopping_cart
    ORDER BY cart_id;
  • ARRAY_TO_STRING - Coerce to string (BQ)

    WITH items AS
      (SELECT ['coffee', 'tea', 'milk' ] as list
      UNION ALL
      SELECT ['cake', 'pie', NULL] as list)
    SELECT ARRAY_TO_STRING(list, '--') AS text
    FROM items;
    +--------------------------------+
    | text                          |
    +--------------------------------+
    | coffee--tea--milk              |
    | cake--pie                      |
    +--------------------------------+
    
    WITH items AS
      (SELECT ['coffee', 'tea', 'milk' ] as list
      UNION ALL
      SELECT ['cake', 'pie', NULL] as list)
    SELECT ARRAY_TO_STRING(list, '--', 'MISSING') AS text
    FROM items;
    +--------------------------------+
    | text                          |
    +--------------------------------+
    | coffee--tea--milk              |
    | cake--pie--MISSING            |
    +--------------------------------+
  • ARRAY_AGG - gather values of a group by variable into an array (doc)

    • Makes the output more readable

    • Example: Get categories for each brand

      -- without array_agg
      select
          brand,
          category
      from order_item
      group by brand, category
      order by brand, category
      ;
      Results:
      | brand  | category  | 
      | ------ | ---------- | 
      | Arket  | jacket    |
      | COS    | shirts    |
      | COS    | trousers  | 
      | COS    | vest      |
      | Levi's | jacket    |
      | Levi's | jeans      |
      
      -- with array_agg
      select
        brand,
        array_agg(distinct category) as all_categories
      from order_item
      group by brand
      order by brand
      ;
      Results:
      | brand  | all_categories              | 
      | ------ | ---------------------------- | 
      | Arket  | ['jacket']                  |
      | COS    | ['shirts','trousers','vest'] |
      | Levi's | ['jacket','jeans']          |
      | Uniqlo | ['shirts','t-shirts','vest'] |
  • ARRAY_SIZE - function takes an array or a variant as input and returns the number of items within the array/variant (doc)

    • Example: How many categories does each brand have?

      select
        brand,
        array_agg(distinct category) as all_categories,
        array_size(all_categories) as no_of_cat
      from order_item
      group by brand
      order by brand
      ;
      Results:
      | brand  | all_categories              | no_of_cat |
      | ------ | ---------------------------  | --------- |
      | Arket  | ['jacket']                  | 1        |
      | COS    | ['shirts','trousers','vest'] | 3        |
      | Levi's | ['jacket','jeans']          | 2        |
      | Uniqlo | ['shirts','t-shirts','vest'] | 3        |
      
      -- postgres using CARDINALITY to get array_size
      SELECT
        cart_id,
        CARDINALITY(products) AS num_products
      FROM
        shopping_cart;
  • ARRAY_CONTAINS checks if a variant is included in an array and returns a boolean value. (doc)

    • Variant is just a specific category

    • Need to cast the item you’d like to check as a variant first

    • Syntax: ARRAY_CONTAINS(variant, array)

    • Example: What brands have jackets?

      select
        brand,
        array_agg(distinct category) as all_categories,
        array_size(all_categories) as no_of_cat,
        array_contains('jacket'::variant,all_categories) as has_jacket
      from order_item
      group by brand
      order by brand
      ;
      Results:
      | brand  | all_categories              | no_of_cat | has_jacket |
      | ------ | ---------------------------  | --------- | ---------- |
      | Arket  | ['jacket']                  | 1        | true      |
      | COS    | ['shirts','trousers','vest'] | 3        | false      |
      | Levi's | ['jacket','jeans']          | 2        | true      |
      | Uniqlo | ['shirts','t-shirts','vest'] | 3        | false      |
      
      -- postgres contains_operator, @>
      SELECT
        cart_id,
        products
      FROM
        shopping_cart
      WHERE
        products  @> ARRAY['product_a', 'product_b'];
      • “@>” example returns all rows with arrays containing product_a and product_b

Business Queries

  • Simple Moving Average (SMA)

    • Example: 7-day SMA including today

      SELECT
        Date, Conversions,
        AVG(Conversions) OVER (ORDER BY Date ROWS BETWEEN 6 PRECEDING AND
        CURRENT ROW) as SMA
      FROM daily_sales
  • Example: 3-day SMA not including today

    select
      date,
      sales,
      avg(sales) over (order by date
            rows between 3 preceding and current row - 1) as moving_avg
    from table_daily_sales
  • Example: Rank product categories by shipping cost for each shipping address

    SELECT Product_Category,
      Shipping_Address,
      Shipping_Cost,
      ROW_NUMBER() OVER
                  (PARTITION BY Product_Category,
                                Shipping_Address
                  ORDER BY Shipping_Cost DESC) as RowNumber,
      RANK() OVER 
            (PARTITION BY Product_Category,
                          Shipping_Address
            ORDER BY Shipping_Cost DESC) as RankValues,
      DENSE_RANK() OVER 
                  (PARTITION BY Product_Category,
                                Shipping_Address 
                  ORDER BY Shipping_Cost DESC) as DenseRankValues
    FROM Dummy_Sales_Data_v1
    WHERE Product_Category IS NOT NULL
    AND Shipping_Address IN ('Germany','India')
    AND Status IN ('Delivered')
    • RANK() retrieves ranked rows based on the condition of ORDER BY clause. As you can see there is a tie between 1st two rows i.e. first two rows have same value in Shipping_Cost column (which is mentioned in ORDER BY clause).
    • DENSE_RANK is similar to the RANK , but it does not skip any numbers even if there is a tie between the rows. This you can see in Blue box in the above picture.
    • Rank resets to 1 when “Shipping_Address” changes location
  • Example: Total order quantity for each month

    SELECT strftime('%m', OrderDate) as Month,
          SUM(Quantity) as Total_Quantity
    from Dummy_Sales_Data_v1
    GROUP BY strftime('%m', OrderDate)
    • strftime extracts the month (%m) from the datetime column, “OrderDate”
  • Example: Monthly and yearly number of orders and average shipping costs

    SELECT EXTRACT(MONTH FROM order_date) AS month
        , EXTRACT(YEAR FROM order_date) AS year
        , COUNT(orderid) AS number_of_orders
        , AVG(shipping_cost) AS avg_shipping_cost
    FROM alldata.salesdata
    GROUP BY EXTRACT(MONTH FROM order_date)
           , EXTRACT(YEAR FROM order_date) WITH ROLLUP;
    • The last row in the above result, where you see NULL in both month and year columns indicates that it shows results for all months in all years i.e. it gives you a total number of orders and average shipping cost in the year 2021.

    • WITH ROLLUP at the end of the GROUP BY clause says perform COUNT and AVG over the whole year. If there were more than one year, I’m not sure if it would compute the values for each year or for the whole time period though.

  • Example: Daily counts of open jobs

    • The issue is that there aren’t rows for transactions that remain in a type of holding status
      • e.g. Job Postings website has date columns for the date the job posting was created, the date the job posting went live on the website, and the date the job posting was taken down (action based timestamps), but no dates for the status between “went live” and “taken down”.
    -- create a calendar column
    SELECT parse_datetime('2020–01–01 08:00:00', 'yyyy-MM-dd H:m:s') + (interval '1' day * d) as cal_date from 
    FROM ( SELECT
    ROW_NUMBER() OVER () -1 as d
    FROM
    (SELECT 0 as n UNION SELECT 1) p0,
    (SELECT 0 as n UNION SELECT 1) p1,
    (SELECT 0 as n UNION SELECT 1) p2,
    (SELECT 0 as n UNION SELECT 1) p3,
    (SELECT 0 as n UNION SELECT 1) p4,
    (SELECT 0 as n UNION SELECT 1) p5,
    (SELECT 0 as n UNION SELECT 1) p6,
    (SELECT 0 as n UNION SELECT 1) p7,
    (SELECT 0 as n UNION SELECT 1) p8,
    (SELECT 0 as n UNION SELECT 1) p9,
    (SELECT 0 as n UNION SELECT 1) p10
    )
    
    -- left-join your table to the calendar column
    Select
        c.cal_date,
        count(distinct opp_id) as "historical_prospects"
    From calendar c
    Left Join
        opportunities o
        on
            o.stage_entered ≤ c.cal_date 
            and (o.stage_exited is null or o.stage_exited > c.cal_date)
    • Calendar column should probably be a CTE
    • Notes from Using SQL to calculate trends based on historical status
    • Some flavours of SQL have a generate_series function, which will create this calendar column for you
    • For one particular month, then create an indicator column with “if posting_publish_date ≤ 2022–01–01 and (posting_closed_date is null or posting_closed_date > 2022–01–31) then True” and then filter for True and count.
  • Example: Get the latest order from each customer

    -- Using QUALIFY
    select
        date,
        customer_id,
        order_id,
        price
    from customer_order_table
    qualify row_number() over (partition by customer_id order by date desc) = 1
    ;
    
    -- CTE w/window function
    with order_order as
    (
    select
        date,
        customer_id,
        order_id,
        price,
        row_number() over (partition by customer_id order by date desc)   
        as order_of_orders
    from customer_order_table 
    )
    
    select
        *
    from order_order
    where order_of_orders = 1
    ;
    Results:
    | date      | customer_id | order_id | price |
    |------------|-------------|----------|-------|
    | 2022-01-03 | 002        | 212      | 350  |
    | 2022-01-06 | 005        | 982      | 300  |
    | 2022-01-07 | 001        | 109      | 120  |
  • Medians

    • Notes from How to Calculate Medians with Grouping in MySQL

      • Variables:

        • pid: unique id variable

        • category: A or B

        • price: random value between 1 and 6

    • Example: Overall median price

      SELECT AVG(sub.price) AS median
      FROM ( 
          SELECT @row_index := @row_index + 1 AS row_index, p.price
          FROM products.prices p, (SELECT @row_index := -1) r
          WHERE p.category = 'A'
          ORDER BY p.price 
      ) AS sub
      WHERE sub.row_index IN (FLOOR(@row_index / 2), CEIL(@row_index / 2))
      ;
      
      median|
      ------+
         3.0|
      • @row_index is a SQL variable that is initiated in the FROM statement and updated for each row in the SELECT statement.

      • The column whose median will be calculated (the price column in this example) should be sorted. It doesn’t matter if it’s sorted in ascending or descending order.

      • According to the definition of median, the median is the value of the middle element (total count is odd) or the average value of the two middle elements (total count is even). In this example, category A has 5 rows and thus the median is the value of the third row after sorting. The values of both FLOOR(@row_index / 2) and CEIL(@row_index / 2) are 2 which is the third row. On the other hand, for category B which has 6 rows, the median is the average value of the third and fourth rows.

    • Example: Median price for each product

      SELECT
          sub2.category,
          CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_prices
               WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_prices, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
          END AS median    
      FROM 
          (
              SELECT 
                  sub1.category,
                  sub1.total,
                  CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
                       WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
                  END AS mid_prices
              FROM 
                  (
                      SELECT
                          p.category,
                          GROUP_CONCAT(p.price ORDER BY p.price) AS prices,
                          COUNT(*) AS total
                      FROM products.prices p
                      GROUP BY p.category
                  ) sub1
          ) sub2
      ;
      
      category|median|
      --------+------+
      A       |3     |
      B       |3.5   |
      • Breaking down the subqueries

        • Sort prices per category

          SELECT
              category,
              GROUP_CONCAT(price ORDER BY p.price) AS prices,
              COUNT(*) AS total
          FROM products.prices p
          GROUP BY p.category
          ;
          
          category|prices     |total|
          --------+-----------+-----+
          A       |1,2,3,4,5  |    5|
          B       |1,2,3,4,5,6|    6|
          • If your table has a lot of data, GROUP_CONCAT would not contain all the data. In this case, you increase the limit for GROUP_CONCAT by: SET GROUP_CONCAT_MAX_LEN = 100000;
        • Get middle prices according to whether the total count is an odd or even number

          SELECT 
              sub1.category,
              sub1.total,
              CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
                   WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
              END AS mid_prices
          FROM 
              (
                  SELECT
                      p.category,
                      GROUP_CONCAT(p.price ORDER BY p.price) AS prices,
                      COUNT(*) AS total
                  FROM products.prices p
                  GROUP BY p.category
              ) sub1
          ;
          
          category|total|mid_prices|
          --------+-----+----------+
          A       |    5|3         |
          B       |    6|3,4       |
          • We use the MOD function (modulo) to check if the total count is an odd or even number.

          • The SUBSTRING_INDEX function is used twice to extract the middle elements.

    • Example: Overall median of price and quantity

      SELECT
          CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_prices
               WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_prices, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
          END AS median_of_price,
          CASE WHEN MOD(sub2.total, 2) = 1 THEN sub2.mid_quantities
               WHEN MOD(sub2.total, 2) = 0 THEN (SUBSTRING_INDEX(sub2.mid_quantities, ',', 1) + SUBSTRING_INDEX(sub2.mid_prices, ',', -1)) / 2
          END AS median_of_quantity
      FROM 
          (
              SELECT 
                  sub1.total,
                  CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', CEIL(sub1.total/2)), ',', '-1')
                       WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.prices, ',', sub1.total/2 + 1), ',', '-2')
                  END AS mid_prices,
                  CASE WHEN MOD(sub1.total, 2) = 1 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.quantities, ',', CEIL(sub1.total/2)), ',', '-1')
                       WHEN MOD(sub1.total, 2) = 0 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(sub1.quantities, ',', sub1.total/2 + 1), ',', '-2')                 
                  END AS mid_quantities
              FROM 
                  (
                      SELECT
                          COUNT(*) AS total,
                          GROUP_CONCAT(o.price ORDER BY o.price) AS prices,
                          GROUP_CONCAT(o.quantity ORDER BY o.quantity) AS quantities
                      FROM products.orders o
                  ) sub1
          ) sub2
      ;
      
      
      median_of_price|median_of_quantity|
      ---------------+------------------+
      3              |30                |
      • Similar to previous example

      • Variables: order_id, price, quantity

Transactions

  • Misc

    • Also see
    • When the transaction is successful, COMMIT is applied. When the transaction is aborted, incorrect execution, system failure ROLLBACK occurs.
  • Only used with INSERT, UPDATE and DELETE

  • BEGIN TRANSACTION: It indicates the start point of an explicit or local transaction.

    • Represents a point ast which the data referenced by a connection is logically and physically consistent.
    • If errors are encountered, all data modifications made after the BEGIN TRANSACTION can be rolled back to return the data to this known state of consistency
    • Syntax: BEGIN TRANSACTION transaction_name ;
  • SET TRANSACTION: Places a name on a transaction.

    • Syntax: SET TRANSACTION [ READ WRITE | READ ONLY ];
  • COMMIT: used to permanently save the changes done in the transaction in tables/databases. The database cannot regain its previous state after its execution of commit.

    • If everything is in order with all statements within a single transaction, all changes are recorded together in the database is called committed. The COMMIT command saves all the transactions to the database since the last COMMIT or ROLLBACK command

    • Example: Delete records

      DELETE FROM Student WHERE AGE = 20;
      COMMIT;
      • Deletes those records from the table which have age = 20 and then commits the changes in the database.
  • ROLLBACK: used to undo the transactions that have not been saved in the database. The command is only been used to undo changes since the last commit

    • If any error occurs with any of the SQL grouped statements, all changes need to be aborted. The process of reversing changes is called rollback. This command can only be used to undo transactions since the last COMMIT or ROLLBACK command was issued.
    • Syntax: ROLLBACK;
  • SAVEPOINT: creates points within the groups of transactions in which to ROLLBACK.

    • Syntax: SAVEPOINT <savepoint_name>;

    • A savepoint is a point in a transaction in which you can roll the transaction back to a certain point without rolling back the entire transaction.

    • Remove a savepoint: RELEASE SAVEPOINT <savepoint_name>

    • Example: Rollback a deletion

      SAVEPOINT SP1;
      //Savepoint created.
      DELETE FROM Student WHERE AGE = 20;
      //deleted
      SAVEPOINT SP2;
      //Savepoint created.
      ROLLBACK TO SP1;
      //Rollback completed.

Processing Expressions

  • Use multiple conditions in a WHERE expression

    select
        *
    from XXX_table
    where 1=1
        (if condition A) and clause 1 
        (if condition B) and clause 2 
        (if condition C) and clause 3
    ;
    • The “1=1” prevents errors that would occur when the first condition doesn’t apply to any rows.
      • Can also use “true”
  • Select unique rows without using DISTINCT

    • Using UNION

      SELECT employee_id,
            employee_name,
            department
      FROM Dummy_employees
      UNION
      SELECT employee_id,
            employee_name,
            department
      FROM Dummy_employees
      • there must be same number and order of columns in both the SELECT statements
    • Using INTERSECT

      SELECT employee_id,
            employee_name,
            department
      FROM Dummy_employees
      INTERSECT
      SELECT employee_id,
            employee_name,
            department
      FROM Dummy_employees
      • There must be same number and order of columns in both the SELECT statements
    • Using ROW_NUMBER

      WITH temporary_employees as (
        SELECT
          employee_id,
          employee_name,
          department,
          ROW_NUMBER() OVER(PARTITION BY employee_name,
                                        department,
                                        employee_id) as row_count
        FROM Dummy_employees
      )
      
      SELECT *
      FROM temporary_employees
      WHERE row_count = 1
    • Using GROUP BY

      SELECT employee_id,
            employee_name,
            department
      FROM Dummy_employees
      GROUP BY employee_id,
              employee_name,
              department
      • Just need to group by all the columns. Useful to use in conjunction with aggregate functions.
  • CASE WHEN

    SELECT OrderID,
          OrderDate,
          Sales_Manager,
          Quantity,
          CASE WHEN Quantity > 51 THEN 'High'
                WHEN Quantity < 51 THEN 'Low'
                ELSE 'Medium' 
          END AS OrderVolume
    FROM Dummy_Sales_Data_v1
    • END AS specifies the name of the new column, “OrderVolume”
    • ELSE specifies the value when none of the conditions are met
      • If you did not mention ELSE clause and no condition is satisfied, the query will return NULL for that specific record
  • Pivot Wider

    SELECT Sales_Manager,
          COUNT(CASE WHEN Shipping_Address = 'Singapore' THEN OrderID
                END) AS Singapore_Orders,
    
          COUNT(CASE WHEN Shipping_Address = 'UK' THEN OrderID
                END) AS UK_Orders,
    
          COUNT(CASE WHEN Shipping_Address = 'Kenya' THEN OrderID
                END) AS Kenya_Orders,
    
          COUNT(CASE WHEN Shipping_Address = 'India' THEN OrderID
                END) AS India_Orders
    FROM Dummy_Sales_Data_v1
    GROUP BY Sales_Manager
    • Depending on your use-case you can also use different aggregation such as SUM, AVG, MAX, MIN with CASE statement.

NULLs

  • Division and NULLS

    • Any division with NULL values with have a result of NULL.

    • isNull allows to get a different resulting value

      SELECT IsNull(<column>, 0) / 45
      • All NULL values in the column will replaced with 0s during the division operation.
  • COALESCE

    • Substitute a default value in place of NULLs

      SELECT COALESCE(column_name, 'Default Value') AS processed_column
      FROM table_name;
      
      SELECT COALESCE(order_date, current_date) AS processed_date
      FROM orders;
      
      SELECT
        product ||' - '||
        COALESCE(subcategory, category, family, 'no product description ')
          AS product_and_subcategory
      FROM stock
      • 3rd Expression: If there is a NULL in subcategory, then it looks in category, then into family, and finally if all those fields have NULLs, it uses “no product description” as the value.
    • Concantenating Strings where NULLs are present

      SELECT COALESCE(first_name, '') || ' ' || COALESCE(last_name, '') AS full_name
      FROM employees;
      • NULLs are replaced with an empty string so transformation doesn’t break
    • Performing calculations involving numeric columns where there are NULLs

      SELECT COALESCE(quantity, 0) * COALESCE(unit_price, 0) AS total_cost
      FROM products;
      
      SELECT product,
        quantity_available,
        minimum_to_have,
        COALESCE(minimum_to_have, quantity_available * 0.5) AS threshold
      FROM stock
      • NULLs are substituted with 0s so the calcuation doesn’t break
    • As part of a join in case keys have missing values

      SELECT *
      FROM employees e
      LEFT JOIN departments d ON COALESCE(e.department_id, 0) = COALESCE(d.id, 0);
    • With Aggregate Functions

      SELECT department_id, COALESCE(SUM(salary), 0) AS total_salary
      FROM employees
      GROUP BY department_id;
    • Make hierarchical subtotals output more readable

      SELECT COALESCE(family,'All Families') AS family,
       COALESCE(category,'All Categories') AS category,
       COALESCE(subcategory,'All Subcategories') AS subcategory,
       SUM(quantity_available) as quantity_in_stock
      FROM stock
      GROUP BY ROLLUP(family, category, subcategory)
      ORDER BY family, category, subcategory
      • ROLLUP clause assumes a hierarchy among the columns familycategory, and subcategory. Thus, it generates all the grouping sets that make sense considering the hierarchy: GROUP BY familyGROUP BY family, category and GROUP BY family, category, subcategory.

        • This is the reason why ROLLUP is often used to generate subtotals and grand totals for reports.
      • Without COALESCE , the text in the unused columns for the subtotals would be NULLs.

Duplicated Rows

  • Uses QUALIFY as a window function to filter out duplicates

    /* removes duplicate rows at the order_id level */
    SELECT * FROM orders
    QUALIFY row_number() over (partition by order_id order by created_at) = 1
    • More verbose example of what’s happening

      WITH temporary_employees as 
      (SELECT 
        employee_id, 
        employee_name, 
        department, 
        ROW_NUMBER() OVER(PARTITION BY employee_name, 
                                      department, 
                                      employee_id) as row_count 
      FROM Dummy_employees)
      
      SELECT * 
      FROM temporary_employees 
      WHERE row_count = 1
  • Use a hash column as id column, then test for duplicates, remove them or investigate them (BigQuery)

    WITH
        inbound_zoo_elephants AS (
            SELECT *
            FROM flowfunctions.examples.zoo_elephants
        ),
        add_row_hash AS (
            SELECT
                *,
                TO_HEX(MD5(TO_JSON_STRING(inbound_zoo_elephants))) AS hex_row_hash
            FROM inbound_zoo_elephants
        )
    
    SELECT
        COUNT(*) AS records,
        COUNT(DISTINCT hex_row_hash) AS unique_records
    FROM add_row_hash
    • No duplicate records found, since “records” = 9 and “unique_records” = 9
      • if records > unique_records, duplicates exist
    • Can select distinct hex_row_hash if you want to remove duplicates
    • Can count hex_row_hash then filter where hex_row_hash > 1 to find which rows are duplicates
    • Notes from link
    • Description
      • flowfunctions is the project name
      • examples is a directory (?)
      • zoo_elephants is the dataset
    • Steps
      • TO_JSON_STRING - creates column with json string for each row
      • MD5 hashes that string
      • TO_HEX makes it alpha-numeric and gets rid of the symbols in the hash
        • Easier to deal with in BigQuery
        • Assume this is still unique (?)
    • Note: By adding “true” value, TO_JSON_STRING(inbound_zoo_elephants, true) , TO_JSON_STRING adds line breaks to the json string for easier readability.
    • Hashing function options
      • MD5 -  shortest one (16 characters), fine for this use case
      • FARM_FINGERPRINT - returns a signed integer of variable length
      • SHA1, SHA256 and SHA512, which return 20, 32 and 64 bytes respectively and are more secure for cryptographic use cases.

Nested Data

  • Recursive CTE
    • Recursive CTEs are used primarily when you want to query hierarchical data or graphs. This could be a company’s organizational structure, a family tree, a restaurant menu, or various routes between cities

    • Also see

    • Syntax

      WITH RECURSIVE cte_name AS (
          cte_query_definition (the anchor member)
          UNION ALL
          cte_query_definition (the recursive member)
      )
      
      SELECT *
      FROM  cte_name;
    • Example: : postgres

      WITH RECURSIVE category_tree(id, name, parent_id, depth, path) AS (
        SELECT id, name, parent_id, 1, ARRAY[id]
        FROM categories
        WHERE parent_id IS NULL
        UNION ALL
        SELECT categories.id, categories.name, categories.parent_id, category_tree.depth + 1, path || categories.id
        FROM categories
        JOIN category_tree ON categories.parent_id = category_tree.id
      )
      
      SELECT id, name, parent_id, depth, path
      FROM category_tree;
      • CTE (WITH) + RECURSIVE says it’s a recursive query.
      • UNION ALLcombines the results of both statements.
        • Example is defined by 2 Select statements
          • Anchor Member: First SELECT statement selects the root nodes of the category tree (nodes with no parent)
            • Root node is indicated by “parent_id” = NULL
          • Recursive member: Second SELECT statement selects the child nodes recursively
        • Also see Arrays for further examples of the use of UNION ALL
      • The “depth” column is used to keep track of the depth of each category node in the tree.
        • “1” in the first statement
        • “category_tree.depth + 1” in the second statement
          • With every recursion, the CTE will add 1 to the previous depth level, and it will do that until it reaches the end of the hierarchy
      • The “path” column is an array that stores the path from the root to the current node.
        • “ARRAY[id]” in the first statement
        • “path || categories.id” in the second statement
          • “||” concatenates “path” and “id” columns (See Strings)

Binning

  • CASE WHEN

    SELECT
     Name, 
     Grade,
     CASE
      WHEN Grade < 10 THEN '0-9'
      WHEN Grade BETWEEN 10 and 19 THEN '10-19'
      WHEN Grade BETWEEN 20 and 29 THEN '20-29'
      WHEN Grade BETWEEN 30 and 39 THEN '30-39'
      WHEN Grade BETWEEN 40 and 49 THEN '40-49'
      WHEN Grade BETWEEN 50 and 59 THEN '50-59'
      WHEN Grade BETWEEN 60 and 69 THEN '60-69'
      WHEN Grade BETWEEN 70 and 79 THEN '70-79'
      WHEN Grade BETWEEN 80 and 89 THEN '80-89'
      WHEN Grade BETWEEN 90 and 99 THEN '90-99'
      END AS Grade_Bucket
     FROM students
    • BETWEEN is inclusive of the end points
    • Flexible for any size of bin you need
  • FLOOR

    SELECT
     Name,
     Grade,
     FLOOR(Grade / 10) * 10 AS Grade_Bucket
    FROM students
    • Can easily scale up the number of bins without having to increase the lines of code
    • Only useful for evenly spaced bins
  • LEFT JOIN on preformatted table

    CREATE OR REPLACE TABLE bins (
        Lower_Bound INT64,
        Upper_Bound INT64,
        Grade_Bucket STRING
    );
    
    INSERT bins (Lower_Bound, Upper_Bound, Grade_Bucket)
    VALUES
     (0, 9, '0-9')
     (10, 19, '10-19')
     (20, 29, '20-29')
     (30, 39, '30-39')
     (40, 49, '40-49')
     (50, 59, '50-59')
     (60, 69, '60-69')
     (70, 79, '70-79')
     (80, 89, '80-89')
     (90, 99, '90-99');
    
    SELECT
     A.Name, 
     A.Grade,
     B.Grade_Bucket
    FROM students AS A
    LEFT JOIN bins AS B
    ON A.Grade BETWEEN B.Lower_Bound AND B.Upper_Bound
    • “bins” table acts a template that funnels the values from your table into the correct bins

Time Series

  • All the records between two dates

    SELECT *
    FROM alldata.salesdata
    WHERE 1=1
    AND order_date BETWEEN '2021-08-01' AND '2021-08-31';
    
    /* without BETWEEN */
    SELECT *
    FROM alldata.salesdata
    WHERE 1=1
    AND order_date >= '2021-08-01' 
    AND order_date <= '2021-08-31';
    • Without BETWEEN is not optimized and therefore, not recommended
  • Extract components from date-time columns

    /* MySQL */
    SELECT EXTRACT(YEAR FROM order_date) AS year
        , EXTRACT(MONTH FROM order_date) AS month
        , EXTRACT(QUARTER FROM order_date) AS quarter
        , EXTRACT(WEEK FROM order_date) AS week_of_year
        , EXTRACT(DAY FROM order_date) AS day_of_month
        , EXTRACT(HOUR FROM order_date) AS hour_of_day
        , EXTRACT(MINUTE FROM order_date) AS minute_of_hour
        , EXTRACT(SECOND FROM order_date) AS second_of_minute
        , order_date
    FROM alldata.salesdata;
    
    /* SQLte */
    SELECT strftime('%m', OrderDate) as Month
    • strftime codes
  • Preprocess Time Series with 4 Lags (article)

    WITH top_customers as (
        --- select the customter ids you want to track
    ),
    transactions as (
        SELECT 
          cust_id, 
          dt, 
          date_trunc('hour', cast(event_time as timestamp)) as event_hour, 
          count(*) as transactions
        FROM ourTable
        WHERE
            dt between cast(date_add('day', -7, current_date) as varchar
            and cast(current_date as varchar)
        GROUP BY 1,2,3 Order By event_hour asc
    )
    
    SELECT transactions.cust_id,
          transactions.event_hour,
          day_of_week(transactions.event_hour) day_of_week,
            hour(transactions.event_hour) hour_of_day,
            transactions.transactions as transactions,
            LAG(transactions,1) OVER 
              (PARTITION BY transactions.cust_id ORDER BY event_hour) AS lag1,
            LAG(transactions,2) OVER 
              (PARTITION BY transactions.cust_id ORDER BY event_hour) AS lag2,
            LAG(transactions,3) OVER 
              (PARTITION BY transactions.cust_id ORDER BY event_hour) AS lag3,
            LAG(transactions,4) OVER 
              (PARTITION BY transactions.cust_id ORDER BY event_hour) AS lag4
    FROM transactions 
        join top_customers 
          on transactions.cust_id = top_customers.cust_id
    
    /* output */
    "cust_id", "event_hour", "day_of_week", "hour_of_day", "transactions", "lag1", "lag2", "lag3", "lag4"
    "Customer-123","2023-01-14 00:00:00.000","6","0","4093",,,,,,
    "Customer-123","2023-01-14 01:00:00.000","6","1","4628","4093",,,,,
    "Customer-123","2023-01-14 02:00:00.000","6","2","5138","4628","4093",,,,
    "Customer-123","2023-01-14 03:00:00.000","6","3","5412","5138","4628","4093",,,
    "Customer-123","2023-01-14 04:00:00.000","6","4","5645","5412","5138","4628","4093",
    "Customer-123","2023-01-14 05:00:00.000","6","5","5676","5645","5412","5138","4628",
    "Customer-123","2023-01-14 06:00:00.000","6","6","6045","5676","5645","5412","5138",
    "Customer-123","2023-01-14 07:00:00.000","6","7","6558","6045","5676","5645","5412",
    • Dataset contains number of transactions made per customer per hour.
    • 2 WITH clauses: the first just extracts a list of customers we are interested in. Here you can add any condition that is supposed to filter in or out specific customers (perhaps you want to filter new customers or only include customers with sufficient traffic). The second WITH clause simply creates the first data set — Dataset A, which pulls a week of data for these customers and selects the customer id, date, hour, and number of transactions.
    • Finally, the last and most important SELECT clause generates Dataset B, by using SQL lag() function on each row in order to capture the number of transactions in each of the hours that preceded the hour in the row.

Tools

  • ChatSQL: Convert plain text to MySQL query by ChatGPT
  • {{sqlglot}} - no dependency Python SQL parser, transpiler, optimizer, and engine
    • Format SQL or translate between nearly twenty different SQL dialects.

      • It doesn’t just transpile active SQL code, too. Moves comments from one dialect to another.
    • The parser itself can be customized

    • Can also help you analyze queries, traverse parsed expression trees, and incrementally (and, programmatically) build SQL queries.

    • support for optimizing SQL queries, and performing semantic diffs.

    • Can be used to unit test queries through mocks based on Python dictionaries.

    • Example: : translate duckdb to hive

      import sqlglot
      sqlglot.transpile(
        "SELECT EPOCH_MS(1618088028295)"
        read = "duckdb"
        write = "hive"
      )[0]
      ---
      'SELECT FROM_UNIXTIME(1618088028295 / 1000)'
  • SQL Studio - SQL Database Explorer [SQLite, libSQL, PostgreSQL, MySQL/MariaDB, DuckDB]