Other

Misc

Non-relational databases

  • More scalable than relational databases, but the lack of SQL support
  • Examples: MongoDB

Data Mart

  • Notes from
  • Required for real-time artificial intelligence, and data analysis for different subsets of business operations.
  • Scaled down version of a data warehouse with a more limited scope for groups of end users in different business units or departments
    • Gives groups more control to customize but also silos the data apart from the rest of the company
    • Enhances a user’s response time due to reduction in volume of data
  • Data is partitioned and allows very granular access control privileges.
  • Data can be segmented and stored on different hardware/software platforms.
  • Schema are the same as Data Warehouses
  • Data Marts should be for unique purposes. But, if you anticipate use cases where multiple marts are also used simulaneously for a project, then each mart should have a common dimensional table.
  • Types
    • Dependent - Gets populated from a single centralized Data Warehouse
      • Usually built using the same architecture as the data warehouse and may use the same data modelling techniques, such as star schema or snowflake schema.
      • Can be built quickly and at a lower cost than other types of data marts.
      • Ensures consistency in data definitions and reduces the risk of data discrepancies.
      • Less flexible and not optimized for the individual groups needs
    • Independent - Built from various data sources, including transactional databases, external APIs, and flat files. Operate autonomously away from a data warehouse.
      • Frequently constructed utilizing dimensional modelling techniques.
      • Can be faster than dependent data marts because they are optimized for a specific department or business unit.
      • May duplicate data that is already in the data warehouse, which can lead to inconsistencies and data quality issues
      • Require more resources to build and maintain compared to dependent data marts.
      • Suited for short-term projects or small groups.
    • Hybrid - This type of data mart can take data from data warehouses or operational systems.
      • Suitable for businesses with multiple databases and need for quick turnaround.
      • Complex to manage due to the integration of data from various sources. Requires careful planning to ensure data consistency and integrity
  • Process
    • Create a initial list of questions the data mart will be used to answer
    • Create Schema documents
      • Include as much information as possible in the schema document because it can be used as a reference if anyone has questions about the data in the future instead of asking you
      • Add any business logic that needs to be applied when reading in the data such as filters and transformation logic as well as noting the time frame of data needed and frequency of update
      • Example

        • Notice the business logic for Field Name: wau (weekly active users) is the distinct count of users where the login date is current date-1 and current date-6. The reason we use current date-1 is because the most recent data is typically from yesterday and taking yesterday minus 6 days gives us 7 days to calculate wau.
        • Field Name: update_date should be set to the last time the ETL was run for this table to let the user know when the data was last updated. Occasionally ETL jobs may fail and this can help troubleshoot if the table was refreshed for the day
    • Create sample tables according to the schema document
      • Source production data for you to validate the tables
      • After the sample tables pass your QA (quality assurance) checks, you can work with the data engineer to back run any history if needed and then have them put the ETL code into production

Transactional Store

  • Optimized for row-based operations such as reading and writing individual records while maintaining data integrity
  • Not specifically built for analytics but can be used for analytic queries as well as low latency information monitoring
  • ACID (atomicity, consistency, isolation, durability) compliant, meaning they guarantee data validity despite errors and ensure that data does not become corrupt because of a failure of some sort.
    • crucial to business use cases that require a high level of data integrity such as transactions happening in banking.
  • Row-based makes it better at writing data.
    • In contrast to data warehouses which are column bases and better for reading data

Operational Data Stores

  • Like a staging area for data required for projects
  • Provide fine-grained non-aggregated data
  • Usually complimentary to a data warehouse
  • General purpose is to integrate data from different sources into a single structure via data cleaning, resolving redundancies, and establishing business rules.