Normalization
Misc
- Used to organize relational databases and optimize storage space
- Normalization is best for OLTP systems (faster writes) while Denormalization is best for OLAP systems (faster reads)
- Normalization helps prevent these issues: (source)
- Also see Forms >> 2NF
- Update Anomaly
- When you update a record and unintentionally affect the validity of other records in your table.
- Example:
- You have a table with order_id, product_id, and product_name. You have two records that have a product_id of 11 and a product_name of “serenity crayons”.
- The company decides to rename the product to “calm crayons”. You only update the one record’s product_name but not the other. Now you have two records with conflicting data, making it unclear which one has the correct value.
- Deletion Anomaly
- When you delete a record in one place but not another. This causes inconsistencies across tables, making it hard to know which records are accurate.
- Example:
- Same duplicated record situation as the previous example, except now you decide to delete product_id 11.
- The issue is whether the deletion procedure you used deleted both records or only one of them.
- Insert Anomaly
- Insert anomalies prevent you from adding important data to your tables because of the lack of that data in another spot.
- Example: You need to add a product to an order record but can’t add the product in the system until someone orders it.
- Process
- Start with an Entity-Relationship (ER) Model
- Then apply normalization rules to refine the model and create a normalized database structure.
- Benefits
- Organizing according to data attributes to reduce or eliminate data redundancy (i.e. having the same data in multiple places).
- It gives you a set of rules to be able to start categorizing your data and forming a layout
- Creates a single point of truth
- Example: If you want to change the name of one airport in a transportation database, you will only need to update a single data value.
- By establishing structure in a database, you are able to help establish a couple of important things: data integrity and scalability.
- Integrity ensures that data is entered correctly and accurately. (e.g. field types, tests)
- Scalability ensures you have organized the data in a way that it is more computationally efficient when you start to run SQL queries.
- Enhanced user adoption and self-service analytics
- Flexibility and scalability to adapt to changing business needs
- Organizing according to data attributes to reduce or eliminate data redundancy (i.e. having the same data in multiple places).
- Notes from
- When Spreadsheets Aren’t Good Enough: A Lesson in Relational Databases
- Gives an example of normalizing a dataset through a MySQL analysis
- When Spreadsheets Aren’t Good Enough: A Lesson in Relational Databases
- Packages
- {{autonormalize}} - Analyzes transaction df and creates relational tables - python library for automated dataset normalization
- {dm} - A grammar of joined tables that provides a consistent set of verbs for consuming, creating, and deploying relational data models
dm objects encapsulate relational data models constructed from local data frames or lazy tables connected to an RDBMS.
dm objects support the full suite of dplyr data manipulation verbs along with additional methods for constructing and verifying relational data models, including key selection, key creation, and rigorous constraint checking.
Once a data model is complete, dm provides methods for deploying it to an RDBMS. This allows it to scale from datasets that fit in memory to databases with billions of rows.
It’s a big package, but for ER and normalization, these functions would seem to be useful
enum_pk_candidates
anddm_enum_fk_candidates
- Examines table(s) or df(s) for primary/foreign key candidatesdecompose_table
- Eextracts two new tables and creates a new key model_id that links both tables.dm_examine_constraints
- Checks referential integrity, i.e. every foreign key holds a primary key that is present in the parent table.- If a foreign key contains a reference where the corresponding row in the parent table is not available, that row is an orphan row and the database no longer has referential integrity.
dm_examine_cardinalities
- Compares number rows between tables using keysdm_draw
- Draws a diagram, a visual representation of the data model with edges linking primary and foreign keys.
Example: Check if a column has unique values (i.e. potential primary key)
any(duplicated(pixar_dm$academy[c("film", "award_type")])) #> [1] FALSE # or with {dm} check_key(pixar_dm$academy, film, award_type) check_key(pixar_dm$academy, film) #> Error in `abort_not_unique_key()`: #> ! (`film`) not a unique key of `pixar_dm$academy`
- film and award_type together can create a unique key (i.e. surrogate key) while just film cannot.
Example: Check for column(s) that point to a primary key in another table (i.e. potential foreign keys
all(pixar_dm$academy$film %in% pixar_dm$pixar_films$film) #> [1] TRUE # or with {dm} check_subset( $academy, film, pixar_dm$pixar_films, film) pixar_dm check_subset( $pixar_films, pixar_dm$academy, film) film, pixar_dm#> # A tibble: 4 × 1 #> film #> <chr> #> 1 Luca #> 2 Turning Red #> 3 Lightyear #> 4 NA #> Error in `abort_not_subset_of()`: #> ! Column (`film`) of table `pixar_dm$pixar_films` contains values (see examples above) that are not present in column (`film`) of table `pixar_dm$academy`.
- Tools
- ChartDB - Free and Open-source database diagrams editor, visualize, and design your DB with a query.
- Factors that influence normalizing dimension tables
- Data redundancy concerns: If minimizing redundancy is crucial, normalization might be preferred.
- Query performance priorities: If query performance is paramount, denormalization often offers advantages.
- Data consistency requirements: High consistency needs might favor normalization.
- Maintenance complexity: Denormalized dimensions can be simpler to maintain in some cases.
- Don’t use external IDs as primary keys
- Since you don’t control those IDs, they can change the format and break your queries.
Entity-Relationship (ER) Modeling
- Also see Databases, Modeling >> Design
- Components
- Entity: A real-world object or concept, like a customer or product.
- Attribute: A property of an entity, such as a name or price.
- Relationship: An association between entities, like a customer placing an order.
- Process
- Identify Entities
- Determine the main objects or concepts in the system. - Involves understanding the business requirements, goals, and processes. Engaging with domain experts, stakeholders, or end-users is crucial to gather accurate information
- These are typically nouns in the problem description. - e.g. An online shopping system, nouns like “Customer,” “Order,” “Product,” and “Payment”
- Types
- Temporal and Event-Driven Entities
- e.g. “Order” in an online shopping system is an event-driven entity because it represents the transaction of purchasing products.
- Group Related Nouns
- Related nouns might form a single entity with multiple attributes or multiple related entities.
- e.g. “Address” might be a part of the “Customer” entity, or it could be a separate entity linked to “Customer” if there’s a need to manage multiple addresses
- Composite Entities
- e.g. an “Order” entity may be composed of multiple “OrderItem” entities. In this case, “OrderItem” might be identified as a separate entity because it has its own attributes, like quantity and price, and is part of the overall “Order.”
- Temporal and Event-Driven Entities
- Questions
- Does this noun represent a real-world object, concept, or event that is crucial to the business?
- Does it have attributes or properties that need to be stored?
- Is it something about which data will be recorded over time?
- Eliminate Redundant or Irrelevant Nouns
- e.g. Both “Customer” and “User,” you may decide that only one of them is necessary, depending on the context.
- e.g. “Information” might be too general and could be split into more specific entities
- Validate the Entities
- Create use cases, scenarios, or user stories that involve these entities. This helps to ensure that all necessary entities are included and that they accurately represent the data and processes of the system.
- Identify Attributes
- Determine the properties or characteristics of each entity.
- These are often adjectives or descriptive phrases.
- Define Boundaries and Scope
- Involves deciding what attributes belong to which entity and what relationships exist between entities.
- e.g. With an “Employee” entity, you might need to decide whether “Salary” should be an attribute of “Employee” or a separate “Salary” entity linked to “Employee.
- Determine Relationships
- Identify how entities interact or relate to each other.
- Usually represented by verbs in the problem description
- Determine Cardinality
- Specify the number of instances of one entity that can be associated with the other entity in the relationship.
- Common types: one-to-one, one-to-many, many-to-many.
- Examples:
- Customer-Order: A customer can place multiple orders, and an order is placed by one customer (One-to-Many).
- Order-Product: An order can include multiple products, and a product can be included in multiple orders (Many-to-Many).
- Product-Supplier: A product can be supplied by multiple suppliers, and a supplier can supply multiple products (Many-to-Many)
- Determine Primary Keys
- A primary key is a unique identifier for each entity. For example, “CustomerID” might be the primary key for the “Customer” entity.
- Create ER Diagram
- Draw the diagram using standard notations.
- Entities are usually represented as rectangles, attributes as ovals, and relationships as diamonds.
- Review and Refine
- Validate the model with stakeholders.
- Refine as necessary based on feedback and additional requirements.
- Identify Entities
Forms
- Databases are often considered as “normalized” if they meet the Third Normal Form (3NF)
- Notes from
- A Complete Guide to Database Normalization in SQL
- Also gives an example of normalizing a dataset through a posgresSQL analysis
- SQL Explained: Normal Forms
- More examples of applying the first 3 forms
- The 5 Forms of Normalization
- A Complete Guide to Database Normalization in SQL
- First Normal Form (1NF)
- No redundant data and there’s a primary key
- Every value in each column of a table must be reduced to its most simple value, also known as atomic.
- An atomic value is one where there are no sets of values within a column. (i.e. 1 value per cell)
- There are no repeating columns or rows within the database.
- Each table should have a primary key which can be defined as a non-null, unique value that identifies each row insertion.
- Second Normal Form (2NF)
- Conforms to first normal form rules.
- Adjust columns so that each table only contains data relating to the primary key.
- Any column that does not depend on the primary key should be split into its own table.
- Foreign keys are used to establish relationships between tables.
- Prevents the possibility of update, deletion, and insertion anomalies. (See Misc)
- Example:
- You have a table with a compount primary key, product_id and order_id, along with another field called order_type.
- Since order_type depends on order_id but not product_id, it doesn’t relate to the entire primary key, therefore it does not follow 2NF.
- order_type should be in a table where order_id is the only primary id
- Third Normal Form (3NF)
- Conforms to both first and second normal form rules.
- Necessary to shift or remove columns (aka attributes) that are transitively dependent, which means they rely on other columns that aren’t foreign or primary keys.
- Example:
- In an orders table, a field called product_name should only depend on the related key, product_id, and nothing else. Since it also depends on order_id in this table, this does not follow 3NF.
- Not 3NF: It’s possible for a product_name to be updated for order_id 123 but not for order_id 111, despite both orders having the same product_id.
- Fourth Normal Form (4NF)
- Conforms to 1NF, 2NF, and 3NF
- Every value in the table must depend on the primary key.
- Example:
- Not 4NF: A table with different combinations of custom t-shirts that can be ordered. Customers can choose from a v-neck or crew-neck, black or white, and logo or no logo.
- If a new style of t-shirt is added, it would be possible to create data inconsistencies in the table. This is because the color and design of the t-shirt do not depend on the style
- 4NF: Different tables for color and design are created, and each table’s values are dependent on the style, which ensures the data is consistent across all combinations.
- Not 4NF: A table with different combinations of custom t-shirts that can be ordered. Customers can choose from a v-neck or crew-neck, black or white, and logo or no logo.
- Fifth Normal Form (5NF)
- Conforms to previous NFs
- A table can be decomposed into smaller tables. Without losing any information when those tables are rejoined.