Normalization

Misc

  • Used to organize relational databases
  • Process
    • Start with an Entity-Relationship (ER) Model
    • Then apply normalization rules to refine the model and create a normalized database structure.
  • Benefits
    • Organizing according to data attributes to reduce or eliminate data redundancy (i.e. having the same data in multiple places).
      • It gives you a set of rules to be able to start categorizing your data and forming a layout
      • Creates a single point of truth
        • Example: If you want to change the name of one airport in a transportation database, you will only need to update a single data value.
    • By establishing structure in a database, you are able to help establish a couple of important things: data integrity and scalability.
      • Integrity ensures that data is entered correctly and accurately. (e.g. field types, tests)
      • Scalability ensures you have organized the data in a way that it is more computationally efficient when you start to run SQL queries.
    • Enhanced user adoption and self-service analytics
    • Flexibility and scalability to adapt to changing business needs
  • Notes from
  • Packages
    • {{autonormalize}} - Analyzes transaction df and creates relational tables - python library for automated dataset normalization
    • {dm} - A grammar of joined tables that provides a consistent set of verbs for consuming, creating, and deploying relational data models
      • dm objects encapsulate relational data models constructed from local data frames or lazy tables connected to an RDBMS.

      • dm objects support the full suite of dplyr data manipulation verbs along with additional methods for constructing and verifying relational data models, including key selection, key creation, and rigorous constraint checking.

      • Once a data model is complete, dm provides methods for deploying it to an RDBMS. This allows it to scale from datasets that fit in memory to databases with billions of rows.

      • It’s a big package, but for ER and normalization, these functions would seem to be useful

        • enum_pk_candidates and dm_enum_fk_candidates - Examines table(s) or df(s) for primary/foreign key candidates
        • decompose_table - Eextracts two new tables and creates a new key model_id that links both tables.
        • dm_examine_constraints - Checks referential integrity, i.e. every foreign key holds a primary key that is present in the parent table.
          • If a foreign key contains a reference where the corresponding row in the parent table is not available, that row is an orphan row and the database no longer has referential integrity.
        • dm_examine_cardinalities - Compares number rows between tables using keys
        • dm_draw - Draws a diagram, a visual representation of the data model with edges linking primary and foreign keys.
      • Example: Check if a column has unique values (i.e. potential primary key)

        any(duplicated(pixar_dm$academy[c("film", "award_type")]))
        #> [1] FALSE
        
        # or with {dm}
        check_key(pixar_dm$academy, film, award_type)
        check_key(pixar_dm$academy, film)
        #> Error in `abort_not_unique_key()`:
        #> ! (`film`) not a unique key of `pixar_dm$academy`
        • film and award_type together can create a unique key (i.e. surrogate key) while just film cannot.
      • Example: Check for column(s) that point to a primary key in another table (i.e. potential foreign keys

        all(pixar_dm$academy$film %in% pixar_dm$pixar_films$film)
        #> [1] TRUE
        
        # or with {dm}
        check_subset(
          pixar_dm$academy, film, 
          pixar_dm$pixar_films, film)
        
        check_subset(
          pixar_dm$pixar_films, 
          film, pixar_dm$academy, film)
        #> # A tibble: 4 × 1
        #>   film       
        #>   <chr>      
        #> 1 Luca       
        #> 2 Turning Red
        #> 3 Lightyear  
        #> 4 NA         
        #> Error in `abort_not_subset_of()`:
        #> ! Column (`film`) of table `pixar_dm$pixar_films` contains values (see examples above) that are not present in column (`film`) of table `pixar_dm$academy`.
  • Tools
    • ChartDB - Free and Open-source database diagrams editor, visualize, and design your DB with a query.
  • Factors that influence normalizing dimension tables
    • Data redundancy concerns: If minimizing redundancy is crucial, normalization might be preferred.
    • Query performance priorities: If query performance is paramount, denormalization often offers advantages.
    • Data consistency requirements: High consistency needs might favor normalization.
    • Maintenance complexity: Denormalized dimensions can be simpler to maintain in some cases.
  • Don’t use external IDs as primary keys
    • Since you don’t control those IDs, they can change the format and break your queries.

Entity-Relationship (ER) Modeling

  • Also see Databases, Modeling >> Design
  • Components
    • Entity: A real-world object or concept, like a customer or product.
    • Attribute: A property of an entity, such as a name or price.
    • Relationship: An association between entities, like a customer placing an order.
  • Process
    • Identify Entities
      • Determine the main objects or concepts in the system. - Involves understanding the business requirements, goals, and processes. Engaging with domain experts, stakeholders, or end-users is crucial to gather accurate information
      • These are typically nouns in the problem description. - e.g. An online shopping system, nouns like “Customer,” “Order,” “Product,” and “Payment”
      • Types
        • Temporal and Event-Driven Entities
          • e.g. “Order” in an online shopping system is an event-driven entity because it represents the transaction of purchasing products.
        • Group Related Nouns
          • Related nouns might form a single entity with multiple attributes or multiple related entities.
          • e.g. “Address” might be a part of the “Customer” entity, or it could be a separate entity linked to “Customer” if there’s a need to manage multiple addresses
        • Composite Entities
          • e.g. an “Order” entity may be composed of multiple “OrderItem” entities. In this case, “OrderItem” might be identified as a separate entity because it has its own attributes, like quantity and price, and is part of the overall “Order.”
      • Questions
        • Does this noun represent a real-world object, concept, or event that is crucial to the business?
        • Does it have attributes or properties that need to be stored?
        • Is it something about which data will be recorded over time?
      • Eliminate Redundant or Irrelevant Nouns
        • e.g. Both “Customer” and “User,” you may decide that only one of them is necessary, depending on the context.
        • e.g. “Information” might be too general and could be split into more specific entities
      • Validate the Entities
        • Create use cases, scenarios, or user stories that involve these entities. This helps to ensure that all necessary entities are included and that they accurately represent the data and processes of the system.
    • Identify Attributes
      • Determine the properties or characteristics of each entity.
      • These are often adjectives or descriptive phrases.
      • Define Boundaries and Scope
        • Involves deciding what attributes belong to which entity and what relationships exist between entities.
        • e.g. With an “Employee” entity, you might need to decide whether “Salary” should be an attribute of “Employee” or a separate “Salary” entity linked to “Employee.
    • Determine Relationships
      • Identify how entities interact or relate to each other.
      • Usually represented by verbs in the problem description
    • Determine Cardinality
      • Specify the number of instances of one entity that can be associated with the other entity in the relationship.
      • Common types: one-to-one, one-to-many, many-to-many.
      • Examples:
        • Customer-Order: A customer can place multiple orders, and an order is placed by one customer (One-to-Many).
        • Order-Product: An order can include multiple products, and a product can be included in multiple orders (Many-to-Many).
        • Product-Supplier: A product can be supplied by multiple suppliers, and a supplier can supply multiple products (Many-to-Many)
    • Determine Primary Keys
      • A primary key is a unique identifier for each entity. For example, “CustomerID” might be the primary key for the “Customer” entity.
    • Create ER Diagram
      • Draw the diagram using standard notations.
      • Entities are usually represented as rectangles, attributes as ovals, and relationships as diamonds.
    • Review and Refine
      • Validate the model with stakeholders.
      • Refine as necessary based on feedback and additional requirements.

Forms

  • Databases are often considered as “normalized” if they meet the third normal form
  • See
  • First Normal Form (1NF)
    • Every value in each column of a table must be reduced to its most simple value, also known as atomic.
      • An atomic value is one where there are no sets of values within a column. (i.e. 1 value per cell)
    • There are no repeating columns or rows within the database.
    • Each table should have a primary key which can be defined as a non-null, unique value that identifies each row insertion.
  • Second Normal Form (2NF)
    • Conforms to first normal form rules.
    • Adjust columns so that each table only contains data relating to the primary key.
    • Foreign keys are used to establish relationships between tables.
  • Third Normal Form (3NF)
    • Conforms to both first and second normal form rules.
    • Necessary to shift or remove columns (attributes) that are transitively dependent, which means they rely on other columns that aren’t foreign or primary keys.