dbt

Overview

  • Built for data modeling
    • Models are like sql queries
  • Modularizes SQL code and makes it reusable across “models”
    • Running the orders model also runs the base_orders model and base_payments model (i.e. dependencies for orders)
      • Base models are in the staging layer and output views which means they’ll have the most up-to-date data for models in the intermediate and mart layers (final output).
    • base_orders and base_payments are independent in that they can also be used in other models
    • Creates more dependable code because you’re using the same logic in all your models
  • You can schedule running sets of models by tagging them (e.g. #daily, #weekly)
  • Version Control
    • Snapshots provide a mechanism for versioning datasets
    • Within every yaml file is an option to include the version
  • Package add-ons that allow you to interact with spark, snowflake, duckdb, redshift, etc.
  • Documentation for every step of the way
    • .yml files can be used to generate a website (localhost:8080) around all of your dbt documentation.

      dbt docs generate
      dbt docs serve

Misc

  • List of available DB Adaptors

    • Runs on Python, so adaptors are installed via pip
  • Notes from

  • Resources

    • Overview of all DBT Products Docs
    • DBT-Core Quickstart Docs
      • Shows you how to set up a project locally, connect to a BigQuery warehouse, build and document models
    • Create a Local dbt Project
      • Uses docker containers to set up a local dbt project and a local postgres db to play around with
  • Development Workflow

    • Mock out a design in a spreadsheet so you know you’re aligned with stakeholders on output goals.
    • Write the SQL to generate that final output, and identify what tables are involved.
    • Write SQL code for basic transformations (staging/base models) in the staging layer which directly source the tables
    • Write models in the intermediate layer and/or mart layer using the more complex SQL code which source the models in the staging layer
    • The final output (materializations) will be the output of the models in the mart layer
    • Finally, with a functioning model running in dbt:
      • Start refactoring and optimizing the mart
      • Add tests and data validation
  • Model Running Workflow

    dbt deps
    dbt seed
    dbt snapshot
    dbt run
    dbt run-operation {{ macro_name }}
    dbt test
  • Commands

    • Reference
    • dbt run - Runs the models you defined in your project
    • dbt build - Builds and tests your selected resources such as models, seeds, snapshots, and tests
    • dbt test - Executes the tests you defined for your project
  • Style Guide Components

    • Naming conventions (the case to use and tense of the column names)
    • SQL best practices (commenting code, CTEs, subqueries, etc.)
    • Documentation standards for your models
    • Data types of date, timestamp, and currency columns
    • Timezone standards for all dates
  • Castor - tool that takes your project and autofills much of the documentation

    • Has a free tier
    • Very helpful if you have the same column name in multiple datasets, you don’t have to keep defining it
    • Tribal Knowledge
      • When a dataset is discussed in a team slack channel, Castor pulls the comments and adds them to the documentation of the dataset
  • Lightdash - BI tool for dbt projects - free tier for self hosting

  • DBT-Cloud

    • Thread from experienced DBT-Core user
      • Likes the extra features if you have the money. Nice for teams with a mix of technical and not-so technical people — enhances collaboration.

Set-Up

  • Docs
  • Within a python virtual environment
    • Create: python3 -m venv dbt-venv

    • Activate: source dbt-venv/bin/activate

      • Should be able to see a (dbt-venv) prefix in every line on the terminal
    • Install dbt-core: pip install dbt-core

      • Specific version: pip install dbt-core==1.3.0
      • Confirm installation by checking version: dbt --version
    • Install plugins

      pip install dbt-bigquery
      pip install dbt-spark
      # etc...
  • Basic set-up: Article
    • Example uses postgres adaptor
  • Connections
    • Docs
    • Example: DBT-Core Quickstart Guide (See Misc >> Resources)
      • Connection to BigQuery in profiles.yml (See Project Files)

        jaffle_shop: # this needs to match the profile in your dbt_project.yml file
            target: dev
            outputs:
                dev:
                    type: bigquery
                    method: service-account
                    keyfile: /Users/BBaggins/.dbt/dbt-tutorial-project-331118.json # replace this with the full path to your keyfile
                    project: grand-highway-265418 # Replace this with your project id
                    dataset: dbt_bbagins # Replace this with dbt_your_name, e.g. dbt_bilbo
                    threads: 1
                    timeout_seconds: 300
                    location: US
                    priority: interactive
      • Check connection

        dbt debug
        #> Connection test: OK connection ok
  • Environments
    • dev - Your local development environment; configured in a profiles.yml file on your computer.
      • While making changes with target: dev, your objects will be built in your development target without affecting production queries made by your end users
    • prod - The production deployment of your dbt project, like in dbt Cloud, Airflow, etc.
      • Once you are confident in your changes and set target: dev, you can deploy the code to production. Running models in this enviroment will affect end user queries.
    • See continuous integration for details on using Github, Gitlab with dbt
  • Schemas
    • Use different schemas within one database to separate your environments
    • If you have multiple dbt users writing code, it often makes sense for each user to have their own development environment.
      • Set your dev target schema to be dbt_<username>. User credentials should also differ across targets so that each dbt user is using their own data warehouse user.
    • Custom Schemas (Docs)
      • Allows you to build models across multiple schemas and group similar models together
      • Avoids dbt users creating models in the same schema and overwriting each other’s work
      • Use Cases
        • Group models based on the business unit using the model, creating schemas such as core, marketing, finance and support.
        • Hide intermediate models in a staging (database) schema, and only present models that should be queried by an end user in an analytics (database) schema (aka production).
      • By default, dbt generates the schema name for a model by appending the custom schema to the target schema.
        • Ensures that objects created in your data warehouse don’t collide with one another.
        • To create a custom macro that overrides this default, see docs
      • Set-Up
        • Inside dbt_project.yml

          # models in `models/marketing/ will be built in the "*_marketing" schema
          models:
            my_project:
              marketing:
                +schema: marketing
        • Inside Model

          {{ config(schema='marketing') }}
          
          select ...

Project Files

  • Project Templates

  • profiles.yml

    • Not included in project directory
    • Only have to worry about this file if you set up dbt locally.
    • Docs
    • Created by dbt init in ~/.dbt/
    • Contents
      • database connection, database credentials that dbt will use to connect to the data warehouse
      • If you work on multiple projects locally, the different project names (configured in the dbt_project.yml file) will allow you to set up various profiles for other projects.
      • Where you set the target (i.e. branch) (e.g. dev or prod)
    • When you use dbt from the CLI, it looks in dbt_project.yml for a name in profile: ‘name’. Then, it looks for that name in profiles.yml in order to find all the information it needs to connect to the database/warehouse and do work.
      • You’ll have one profile name/info for each warehouse that you use
    • Target
      • dev target: An analyst using dbt locally will have this set as the default.
      • prod target: Creates the objects in your production schema.
        • However, since it’s often desirable to perform production runs on a schedule, it is recommended that you deploy your dbt project to a separate machine other than your local machine.
        • Most dbt users only have a dev target in their profile on their local machine.
      • Components
        • type: The type of data warehouse you are connecting to
        • Warehouse credentials: Get these from your database administrator if you don’t already have them. Remember that user credentials are very sensitive information that should not be shared.
        • schema: The default schema that dbt will build objects in.
        • threads: The number of threads the dbt project will run on.
  • dbt_project.yml

    • Docs
    • Main configuration file for your project
    • Basic Set-Up
      • Fill in your project name and profile name
      • Add variables and models
    • All available configurations that can be included in this file
  • packages.yml

    • Docs
    • List of external dbt packages you want to use in your project
    • After adding the packages to packages.yml, run dbt deps to install the packages.
    • Packages get installed in the dbt_packages directory — by default this directory is ignored by git, to avoid duplicating the source code for the package.
    • When you update a version or revision in your packages.yml file, it isn’t automatically updated in your dbt project. You should run dbt deps to update the package. You may also need to run a full refresh of the models in this package.
    • Examples
      • Specific Version

        packages:
            - package: dbt-labs/dbt_utils
              version: 0.7.3
      • Range of Versions

        packages:
          - package: calogica/dbt_expectations
            version: [">=0.7.0", "<0.8.0"]
  • dependencies.yml

    • Docs
    • Designed for the dbt Mesh and cross-project reference workflow
    • Can contain both types of dependencies: “package” and “project” dependencies
    • If your dbt project doesn’t require the use of Jinja within the package specifications, you can simply rename your existing packages.yml to dependencies.yml.
    • However, if your project’s package specifications use Jinja, particularly for scenarios like adding an environment variable or a Git token method in a private Git package specification, you should continue using the packages.yml file name.

Directories

  • Example: Source

  • Models

    • Sources (i.e. data sources) are defined in src_<source>.yml files in your models directory
      • .yml files contain definitions and tests (Docs)
      • .doc files contain source documentation
    • Models (i.e. sql queries) are defined stg_<source>.yml
      • .yml files contain definitions and tests (Docs)
        • Example:

          models:
            - name: your_model_name
              config:
                materialized: incremental
                unique_key: primary_key_name
      • .doc files contain source documentation
      • The actual models are the .sql files
    • Example
      • Staging:
        • Different data sources will have separate folders underneath staging (e.g. stripe).
      • Marts:
        • Use cases or departments have different folders underneath marts (e.g. core or marketing)
  • Macros

    • Docs
    • Similar to functions in excel
    • Define custom functions in the macros folder or override default macros and macros from a package
    • See bkmks for tutorials on writing custom macros with jinja
    • {dbtplyr} macros
      • dplyr tidy selectors, across, etc.
  • Seeds

    • Docs
    • Seeds are csv files that you add to your dbt project to be uploaded to your data warehouse.
      • Uploaded into your data warehouse using the dbt seed command
    • Best suited to static data which changes infrequently.
      • Use Cases:
        • A list of unique codes or employee ids that you may need in your analysis but is not present in your current data.
        • A list of mappings of country codes to country names
        • A list of test emails to exclude from analysis
    • Referenced in downstream models the same way as referencing models — by using the ref function
  • Snapshots

    • Captures of the state of a table at a particular time
    • Docs, More Docs
    • build a slowly changing dimension (SCD) table for sources that do not support change data capture (CDC)
    • Example
      • Every time the status of an order change, your system overrides it with the new information. In this case, there we cannot know what historical statuses that an order had.
      • Daily snapshots of this table builds a history and allows you to track order statuses

Components

Materializations

  • Misc
    • The manner in which the data is represented, i.e. the outputs of the models
    • Docs
    • Start with models as views, when they take too long to query, make them tables, when the tables take too long to build, make them incremental.
  • View
    • Default materialization when there is no specification in the configuration
    • Just like Standard Views, i.e. a saved SQL query (See SQL >> Views)
    • Always reflects the most up-to-date version of the input data
    • Cost nothing to build since they’re just queries, but have to be processed every time they’re ran.
      • Slower to return results than a table of the same data. 
      • Can cost more over time, especially if they contain intensive transformations and are queried often.
    • Best for
      • Building a model that stitches lots of other models together (Don’t want to worry about freshness of ever table involved).
      • Small datasets with minimally intensive logic that you want near realtime access to
  • Table
    • Stores the literal rows and columns on disk (i.e. materialized view)
    • Ideal for models that get queried regularly
    • Reflects the most recent run on the source data that was available
    • Best for sources for BI products and other frontend apps (e.g. like a mart that services a popular dashboard)
  • Incremental
    • Docs
    • Builds a table in pieces over time — only adding and updating new or changed records which makes them efficient.
    • Builds more quickly than a regular table of the same logic.
    • Typically used on very large datasets, so building the initial table on the full dataset is time consuming and equivalent to the table materialization.
    • Adds complexity and requires deeper consideration of layering and timing.
      • See “Late arriving facts” section of the Docs
    • Can drift from source data over time since you’re not processing all of the source data. Extra effort is required to capture changes to historical data.
      • See “Long-term considerations” section of the Docs
    • Best for — same as tables, but for larger datasets.
    • Requirements
      • A filter to select just the new or updated records
      • A conditional block that wraps the filter and only applies it when we want it
      • A configuration that tells dbt to build incrementally and helps apply the conditional filter when needed
    • What happens underneath inside the data warehouse
      • When there are records that already exist within the table but have a new field value:
        • Option 1: delete + insert deletes these records and replaces them with an updated version of that record.
        • Option 2: merge combines the insert, update, and delete statements into one. Instead of deleting changed records, it will replace the old field values with the new ones.
        • Option 3: insert + overwrite applies to the same records as the other strategies, but overwrites all fields of an existing record rather than just the ones that have changed
  • Ephemeral
    • Creates a CTE
    • More difficult to troubleshoot, as they’re interpolated into the models that ref them, rather than existing on their own in a way that you can view the output of.
    • Use Cases
      • Very light-weight transformations that are early on in your DAG
      • Should only used in one or two downstream models
      • For a materialization that does not need to be queried directly
        • Macros called using dbt run-operation cannot ref() ephemeral materializations
    • Doesn’t support model contracts
  • Set-Up
    • In practice, you want to set materializations at the folder level, and use individual model configs to override those as needed.

    • In dbt_project.yml

      • A + is used to indicate a materialization

      • Example: models

        models:
          jaffle_shop:
            marketing:
              +materialized: view
              paid_ads:
                google:
                  +materialized: table
        • All models in the marketing and paid_ads folders are views while models in the google folder are tables
    • In Model files

      • Example: Table (or View)

        {{
            config(
                materialized='table'
            )
        }}
        
        select ...
      • Example: Incremental

        {{
            config(
                materialized='incremental',
        1        unique_key='order_id'
            )
        }}
        
        select * from orders
        
        2{% if is_incremental() %}
        
        where
        3  updated_at > (select max(updated_at) from {{ this }})
        
        {% endif %}
        1
        unique_key specifies the field that allows dbt to find a record from the previous run (old data) with the same unique id as the new data in order to update that record instead of adding it as a separate row.
        2
        An if-else conditional that uses is_incremental (macro) to check if the systematic conditions necessary for an incremental model are met (See below)
        3
        The conditional block containing the filter that finds the latest updated_at date, so dbt knows whether the source data has new records that need to be added or not.
        • { this } means filter from “this” materialization
        • is_incremental (macro) checks:
          • materialized =‘incremental’
          • There is an existing table for this model in the warehouse to add to/update
          • --full-refresh flag was not passed (docs)
            • Overrides the incremental materialization and builds the table from scratch again
            • Useful if the table drifts from the source table over time.

Variables

  • Defined in the project.yml and used in models

    • Example: Assigning States to Regions

      vars:
        state_lookup:
          Northeast:
            - CT
            - ME
          Midwest:
            - IL
            - IN
  • Using the variables in a model

    {# Option 1 #}
    SELECT state,
          CASE {% for k, v in var("state_lookup").items() %}
                WHEN state in ({% for t in v %}'{{ t }}'{% if not loop.last %}, {% endif %}{% endfor %}) THEN {{ k }}{% endfor %}
                ELSE NULL END AS region
      FROM {{ ref('my_table') }}
    
    {# Option 2 #}
    SELECT state,
          CASE {% for k, v in var("state_lookup").items() %}
                WHEN state in ({{ t|csl }}) THEN {{ k }}{% endfor %}
                ELSE NULL END AS region
      FROM {{ ref('my_table') }}
    • Variables are accessed using var
    • This is a complicated example, see docs for something simpler
    • {% ... %} are used to encapsulate for-loops and if-then conditions, see docs
      • {# ... #} is for comments
    • Option 2 uses a csl filter (comma-separated-list)

Models

Misc

  • Tagging
    • Allows you to run groups of models
      • Example:dbt run --models tag:daily

YAML Options

  • Contracts
    • Enforces constraints on the materialization output from a model (emphemeral not supported)
      • Data Tests differ from contracts since they are a more flexible mechanism for validating the content of your model after it’s built (contracts execute at build time)
      • Contracts probably requires less compute (cost) in your data platform
    • Docs
    • Incremental materializations requires setting: on_schema_change: append_new_columns
    • Set-Up
      • When enforced: true, name and data_type are required to be set for every column

      • Set alias_types: false to opt-out of letting dbt try to convert a specified type to one that conforms to a specific platform

        • e.g. Specify string in your contract while on Postgres/Redshift, and dbt will convert it to text.
      • If a varchar size or numeric scale is not specified, then dbt relies on default values

      • Example

        models:
          - name: dim_customers
            config:
              materialized: table
              contract:
                enforced: true
            columns:
              - name: customer_id
                data_type: int
                constraints:
                  - type: not_null
              - name: customer_name
                data_type: string
              - name: non_integer
                data_type: numeric(38,3)
        • numeric(38,3) says non-integer is allowed to have 38 (precision) total digits (whole + decimal) and 3 decimal places (scale).
  • Constraints
    • When specified, the platform (e.g. postgres, BigQuery, etc.) will perform additional validation on data as it is being populated in a new table or inserted into a preexisting table. If the validation fails, the table creation or update fails (or there’s a warning), the operation is rolled back, and you will see a clear error message.

    • Only table and incremental materializations supported

    • Can be set at the column level (recommended) or model level

    • Keys and Values

      • type (required): one of not_null, unique, primary_key, foreign_key, check, custom
        • postgres supports and enforces all types, but Redshift, Snowflake, BigQuery, Databricks, Spark do not. See docs, docs for details.
      • expression: Free text input to qualify the constraint.
        • Required for certain constraint types and optional for others.
      • name (optional): Human-friendly name for this constraint. Supported by some data platforms.
      • columns (model-level only): List of column names to apply the constraint over
    • Example

      models:
        - name: <model_name>
      
          # required
          config:
            contract:
              enforced: true
      
          # model-level constraints
          constraints:
            - type: primary_key
              columns: [FIRST_COLUMN, SECOND_COLUMN, ...]
            - type: FOREIGN_KEY # multi_column
              columns: [FIRST_COLUMN, SECOND_COLUMN, ...]
              expression: "OTHER_MODEL_SCHEMA.OTHER_MODEL_NAME (OTHER_MODEL_FIRST_COLUMN, OTHER_MODEL_SECOND_COLUMN, ...)"
            - type: check
              columns: [FIRST_COLUMN, SECOND_COLUMN, ...]
              expression: "FIRST_COLUMN != SECOND_COLUMN"
              name: HUMAN_FRIENDLY_NAME
            - type: ...
      
          columns:
            - name: FIRST_COLUMN
              data_type: DATA_TYPE
      
              # column-level constraints
              constraints:
                - type: not_null
                - type: unique
                - type: foreign_key
                  expression: OTHER_MODEL_SCHEMA.OTHER_MODEL_NAME (OTHER_MODEL_COLUMN)
                - type: ...

Best Practices

  • Modularity where possible

    • Same as the functional mindset: “if there’s any code that’s continually repeated, then it should be a function(i.e. its own separate model in dbt).”
  • Readability

    • Comment
    • Use CTEs instead of subqueries
    • Use descriptive names
      • Example: if you are joining the tables “users” and “addresses” in a CTE, you would want to name it “users_joined_addresses” instead of “user_addresses”
  • Example: Comments, CTE, Descriptive Naming

    WITH
    Active_users AS (
      SELECT
        Name AS user_name,
        Email AS user_email,
        Phone AS user_phone,
        Subscription_id
      FROM users
      --- status of 1 means a subscription is active
      WHERE subscription_status = 1
    ),
    Active_users_joined_subscriptions AS (
      SELECT
        Active_users.user_name,
        active_users.user_email,
        Subscriptions.subscription_id,
        subscriptions.start_date ,
        subscriptions.subscription_length
      FROM active_users
      LEFT JOIN subscriptions
        ON active_users.subscription_id = subscriptions.subscription_id
    )
    SELECT * FROM Active_users_joined_subscriptions

Layers

  • Staging

    • Subdirectories based on the data sources (not business entities)
    • Naming Convention
      • Syntax: stg_[source]__[entity]s.sql
        • The double underscore between source system and entity helps visually distinguish the separate parts in the case of a source name having multiple words.
    • Contains all the individual components of your project that the other layers will use in order to craft more complex data models.
    • Once you get a feel for these models, use the codegen package to automatically generate them. (See Packages)
    • Data Sources
      • Only layer where the source macro is used
      • Each model bears a one-to-one relationship with the source data table it represents (i.e. 1 staging model per source)
    • Transformations
      • Only transformations that are needed for every downstream model should be applied.
      • Typical Transformations: recasting, column/view renaming, basic computations (e.g. cents to dollars), binning continuous fields, collapsing categoricals fields.
        • Unions may also be applied if you have two sources with the same fields
          • e.g. two tables that represent the same store except that perhaps they are in different regions or selling on two different platforms, and your project only wants to use aggregated calculations from both.
      • Aggregations and joins should be avoided. But, if joins are necessary, then create a subdirectory called base within the source directory.
        1. Within that subdirectory, create base_ models that perform the previously mentioned “Typical Transformations.”
        2. Then, at the source directory level, create staging, stg_, models that join those base models that are in the subdirectory. (See docs for examples)
      • Base Models
        • Basic transformations (e.g. cleaning up the names of the columns, casting to different data types)
        • Other models use these models as data sources
          • Prevents errors like accidentally casting your dates to two different types of timestamps, or giving the same column two different names.
            • Two different timestamp castings can cause all of the dates to be improperly joined downstream, turning the model into a huge disaster
        • Usually occuring in staging
        • Read directly from a source, which is typically a schema in your data warehouse
          • Source object: { source('campaigns', 'channel') }
            • campaigns is the name of the source in the .yml file
            • channel is the name of a table from that source
    • Materialized as views.
      • Allows any intermediate or mart models referencing the staging layer to get access to fresh data and at the same time it saves us space and reduces costs.

      • In dbt_project.yml

        models:
          jaffle_shop:
            staging:
              +materialized: view
        • View is the default materialization, but still good to set manually just for clarification purposes.
  • Intermediate

    • Uses views from the staging layer to build more complex models
    • Subdirectories based on business entities (e.g. marketing, finance, etc.)
      • If you have fewer than 10 models and aren’t having problems developing and using them, then feel free to forego subdirectories completely.
    • Naming Convention
      • Syntax: int_[entity]s_[verb]s.sql
      • Verb Examples: pivoted, aggregated_to_user, joined, fanned_out_by_quantity, funnel_created, etc.)
      • No double underscores
        • If for some reason you had to use a source macro in this layer, then you could use double underscores between entity(s) and verb(s)
      • Example: int_payments_pivoted_to_orders
    • Other models should reference the models within this stage as Common Table Expressions although there may be cases where it makes sense to materialize them as Views
      • Macros called via run-operation cannot reference ephemeral materizations (i.e. CTEs)
      • Recommended to start with ephemeral materializations unless this doesn’t work for the specific use case
    • Whenever you decide to materialize them as Views, it may be easier to to do so in a custom schema, that is a schema outside of the main schema defined in your dbt profile.
    • If the same intermediate model is referenced by more than one model then it means your design has probably gone wrong.
      • Usually indicates that you should consider turning your intermediate model into a macro. reference the base models rather than from a source
      • Reference object
        • { ref('base_campaign_types') }
        • base_campaign_types is a base model
  • Marts

    • Where everything comes together in a way that business-defined entities and processes are constructed and made readily available to end users via dashboards or applications.
    • Since this layer contains models that are being accessed by end users it means that performance matters. Therefore, it makes sense to materialize them as tables.
      • If a table takes too much time to be created (or perhaps it costs too much), then you may also need to consider configuring it as an incremental model.
    • A mart model should be relatively simple and therefore, too many joins should be avoided
    • Example
      • A monthly recurring revenue (MRR) model that classifies revenue per customer per month as new revenue, upgrades, downgrades, and churn, to understand how a business is performing over time.
        • It may be useful to note whether the revenue was collected via Stripe or Braintree, but they are not fundamentally separate models.
  • Utilities

    • Directory for any general purpose models that are generated from macros or based on seeds that provide tools to help modeling, rather than data to model itself.
    • The most common use case is a date spine generated with the dbt utils package.

Optimizations

  • Runs parallelized
    • Models that have dependencies aren’t run until their upstream models are completed but models that don’t depend on one another are run at the same time.
    • The thread parameter in your dbt_project.yml specifies how many models are permitted to run in parallel
  • Incremental Materializations
    • See Components >> Materializations
    • Saves money by not rebuilding the entire table from scratch when records only need to be updated or added to a table (plus transformations to the new records).
    • Useful for big data tables and expensive transformations

Packages

  • Available Packages: link
  • See Project Files >> packages.yml
  • Packages
    • dbt-athena - Adaptor for AWS Athena
    • audit-helper
      • Compares columns, queries; useful if refactoring code or migrating db
    • codegen
      • Generate base model, barebones model and source .ymls
    • dbtplyr - Macros
      • dplyr tidy selectors, across, etc.
    • dbt-duckdb - Adapter for duckdb
    • elementary - Includes schema change alerts and anomaly detection tests. It is great for tracking trends in your data to see if anything falters from the usual
    • dbt-expectations
      • Data validation based on great expectations py lib
    • external-tables
      • Create or replace or refresh external tables
      • Guessing this means any data source (e.g. s3, spark, google, another db like snowflake, etc.) that isn’t the primary db connected to the dbt project
    • logging
      • Provides out-of-the-box functionality to log events for all dbt invocations, including run start, run end, model start, and model end.
      • Can slow down runs substantially
    • profiler
      • Implements dbt macros for profiling database relations and creating doc blocks and table schemas (schema.yml) containing said profiles
    • re_data
      • Dashboard for monitoring, macros, models
    • dbt-score
      • Linting of models and sources which have metadata associated with them: documentation, tests, types, etc.
    • dbt_set_similarity
    • spark-utils
      • Enables use of (most of) the {dbt-utils} macros on spark
    • dbt-utils
      • Ton of stuff for tests, queries, etc.

Data Validation and Unit Tests

Misc

  • Test both your raw data (sources) and transformed data (models). It’s important to test both to help you pinpoint where issues originate, helping in the debugging process.
    • Source Tests: Confirms that you are transforming quality data, allowing you to be confident in the input of your data models. When you are confident in the input, data quality issues can then be attributed to transformation code.
    • Model Tests: They come after source testing. When adding a test to a model, you are setting expectations on how the code you’ve written should transform your data.
      • If the data produced by a model passed a test when it was first written, but now fails, its most likely due to something in the data changing that your code can no longer handle.
  • Built-in support for CI/CD pipelines to test your “models” and stage them before committing to production
  • Run test - dbt test
  • See also
  • Most of the tests are defined in a models-type .yml file in the models directory
  • Can be applied to a model or a column
  • Test that the primary key column is unique and not null
  • Differences from Contracts

Macros

  • Custom (aka Singular) tests should be located in a tests folder.
    • dbt will evaluate the SQL statement.

    • The test will pass if no row is returned and failed if at least one or more rows are returned.

    • Useful for testing for some obscurity in the data

    • Example: Check for duplicate rows when joining two tables

      select
      a.id
      from {{ ref(‘table_a’) }} a
      left join {{ ref(‘table_b’) }} b
      on a.b_id = b.id
      group by a.id
      having count(b.id)>1
      • i.e. If I join table a with table b, there should only be one record for each unique id in table a
      • Process
        • Join the tables on their common field
        • Group them by the id that should be distinct
        • Count the number of duplicates created from the join.
          • This tells me that something is wrong with the data.
          • Add a having clause to filter out the non-dups

Mock Data

  • Data used for unit testing SQL code
  • To ensure completeness, it’s best if analysts or business stakeholders are the ones provide test cases or test data
  • Store in the “data” folder (typically .csv files)
    • each CSV file represents one source table
    • should be stored in a separate schema (e.g. unit_testing) from production data
    • dbt seed (see below, Other >> seeds) command is used to load mock data into the data warehouse

Tests

  • Misc
    • Notes from How to Test Your Data Models with dbt
    • Packages
      • dbt-expectations, dbt-utils, elementary
    • Besides Freshness, Uniqueness, and Not NULL, only add more tests if there’s an actionable next step when they fail.
    • Considerations
      • Which source fields are used in downstream data models? Are there qualities you should be testing to ensure the transformations downstream turn out as expected?
        • e.g. testing the datatype of an amount field that’s used in a SUM calculation downstream
      • Are there id fields used in JOINs downstream? Would adding a relationships test prevent joining records that don’t exist?
      • Are CASE statements used in your model code? Should there be an accepted_values test at the source to ensure all possible values are captured?
  • Freshness (docs) - Used to define the acceptable amount of time between the most recent record, and now, for a table
    • The built-in test only available for sources, but dbt_expectations has one for models

    • Freshness tests on sources should match the frequency of your ingestion syncs

      • See Production, Tools >> Ingestion >> Concepts >> Sync Methods
      • e.g. If you sync your data every hour because you expect new records in a source every hour, your freshness tests should reflect this.
    • Should only be configured for critical data sources. If you aren’t using a data source despite having it documented, freshness tests may only contribute to test bloat.

    • Identify the timestamp field that represents when your data was created or ingested.

      • Examples
        • A created_at field directly from the source itself
        • An autogenerated field by your ingestion tool
          • e.g. Airbyte’s _airbyte_extracted_at field which is automatically created in each data source
    • Example

      sources:
        - name: sales_pipeline
          freshness:
             warn_after:
                count: 2
                period: hour
             error_after:
                count: 2
                period: hour
          loaded_at_field: _airbyte_extracted_at
    • Example: dbt_expectations

      models:
        - name: sales_mrr
            - name: mrr_computed_at 
              tests:
                 - dbt_expectations.expect_row_values_to_have_recent_data:
                     datepart: week
                     interval: 1
  • Uniqueness - Tests that the value in a field is unique across all records.
    • Important to apply to the primary key of every source and model to ensure this field accurately represents a unique record

    • Example

      sources:
        - name: sales_pipeline
          columns:
            - name: opportunity_id
              tests:
                - unique
  • Not NULL - Should be applied to every field where you expect there to be a value, particularly on the primary key.
    • Example

      sources:
        - name: sales_pipeline
          columns:
            - name: opportunity_id
              tests:
                - unique
                - not_null
  • Recency - Asserts that a timestamp column in the reference model contains data that is at least as recent as the defined date interval.
    • Can be useful for incremental models

    • Example: dbt-utils

      version: 2
      
      models:
        - name: model_name
          tests:
            - dbt_utils.recency:
                datepart: day
                field: created_at
                interval: 1

Examples

  • Example: project.yml
  • Example: Unit Test
    • Add test to dbt_project.yml

      seeds:
        unit_testing:
          revenue:
            schema: unit_testing
            +tags:
              - unit_testing
      • Every file in the unit_testing/revenue folder will be loaded into unit_testing
      • Executing dbt build -s +tag:unit_testing will run all the seeds/models/tests/snapshots with tag unit_testing and their upstreams
    • Create macro that switches the source data in the model being tested from production data (i.e. using { source() } ) to mock data (i.e. using ref ) when a unit test is being run

      {% macro select_table(source_table, test_table) %}
            {% if var('unit_testing', false) == true %}
      
                  {{ return(test_table) }}
            {% else %}
                  {{ return(source_table) }}
            {% endif %}
      {% endmacro %}
      • Article calls this file “select_table.sql”
      • 2 inputs: “source_table” (production data) and “test_table” (mock data)
      • macro returns the appropriate table based on the variable in the dbt command
        • If the command doesn’t provide unit_testing variable or the value is false , then it returns source_table , otherwise it returns test_table.
    • Add macro code chunk to model

      {{ config
          (
              materialized='table',
              tags=['revenue']
          )
      }}
      {% set import_transaction = select_table(source('user_xiaoxu','transaction'), ref('revenue_transaction')) %}
      {% set import_vat = select_table(source('user_xiaoxu','vat'), ref('revenue_vat')) %}
      SELECT
          date
          , city_name
          , SUM(amount_net_booking) AS amount_net_booking
          , SUM(amount_net_booking * (1 - 1/(1 + vat_rate)))  AS amount_vat
      FROM {{ import_transaction }}
      LEFT JOIN {{ import_vat }} USING (city_name)
      GROUP BY 1,2
      • Inside the {%...%} , the macro “select_table” is called to set the local variables, “import_transaction” and “import_vat” which are later used in the model query
      • Model file is named “revenue2.sql”
    • Run model and test using mock data: dbt build -s +tag:unit_testing --vars 'unit_testing: true'

      • Run model with production data (aka source data): dbt build -s +tag:revenue --exclude tag:unit_testing
    • Compare output

      version: 2
      models:
        - name: revenue
          meta:
            owner: "@xiaoxu"
          tests:
            - dbt_utils.equality:
                compare_model: ref('revenue_expected')
                tags: ['unit_testing']
      • Model properties file that’s named revenue.yml in the models directory
      • By including tags: [‘unit_testing’] we can insure that we don’t run this test in production (see build code above with --exclude tag:unit_testing
    • Macro for comparing numeric output

      {% test advanced_equality(model, compare_model, round_columns=None) %}
      {% set compare_columns = adapter.get_columns_in_relation(model) | map(attribute='quoted') %}
      {% set compare_cols_csv = compare_columns | join(', ') %}
      {% if round_columns %}
          {% set round_columns_enriched = [] %}
          {% for col in round_columns %}
              {% do round_columns_enriched.append('round('+col+')') %}
          {% endfor %}
          {% set selected_columns = '* except(' + round_columns|join(', ') + "), " + round_columns_enriched|join(', ') %}
      {% else %}
          {% set round_columns_csv = None %}
          {% set selected_columns = '*' %}
      {% endif %}
      with a as (
          select {{compare_cols_csv}} from {{ model }}
      ),
      b as (
          select {{compare_cols_csv}} from {{ compare_model }}
      ),
      a_minus_b as (
          select {{ selected_columns }} from a
          {{ dbt_utils.except() }}
          select {{ selected_columns }} from b
      ),
      b_minus_a as (
          select {{ selected_columns }} from b
          {{ dbt_utils.except() }}
          select {{ selected_columns }} from a
      ),
      unioned as (
          select 'in_actual_not_in_expected' as which_diff, a_minus_b.* from a_minus_b
          union all
          select 'in_expected_not_in_actual' as which_diff, b_minus_a.* from b_minus_a
      )
      select * from unioned
      {% endtest %}
      • File called “advanced_equality.sql”