Lakes

Misc

  • Data is stored in structured format or in its raw native format without any transformation at any scale.
    • Handling both types allows all data to be centralized which means it can be better organized and more easily accessed.
  • Optimal for fit for bulk data types such as server logs, clickstreams, social media, or sensor data.
  • Ideal use cases
    • Backup for logs
    • Raw sensor data for your IoT application,
    • Text files from user interviews
    • Images
    • Trained machine learning models (with the database simply storing the path to the object)
  • Tools
    • Rclone - A command-line program to manage files on cloud storage. It is a feature-rich alternative to cloud vendors’ web storage interfaces. Over 70 cloud storage products support rclone including S3 object stores
  • Lower storage costs due to their more open-source nature and undefined structure
  • On-Prem set-ups have to manage hardward and environments
    • If you wanted to separate stuff like test data from production data, you also probably had to set up new hardware.
    • If you had data in one physical environment that had to be used for analytical purposes in another physical environment, you probably had to copy that data over to the new replica environment.
      • Have to keep a tie to the source environment to ensure that the stuff in the replica environment is still up-to-date, and your operational source data most likely isn’t in one single environment. It’s likely that you have tens — if not hundreds — of those operational sources where you gather data.
    • Where on-prem set-ups focus on isolating data with physical infrastructure, cloud computing shifts to focus on isolating data using security policies.
  • Object Storage Systems
    • Cloud data lakes provide organizations with additional opportunities to simplify data management by being accessible everywhere to all applications as needed
    • Organized as collections of files within directory structures, often with multiple files in one directory representing a single table.
      • Pros: highly accessible and flexible
      • Metadata Catalogs are used to answer these questions:
        • What is the schema of a dataset, including columns and data types
        • Which files comprise the dataset and how are they organized (e.g., partitions)
        • How different applications coordinate changes to the dataset, including both changes to the definition of the dataset and changes to data
      • Hive Metastore (HMS) and AWS Glue Data Catalog are two popular catalog options
        • Contain the schema, table structure and data location for datasets within data lake storage
    • Issues:
      • Does not coordinate data changes or schema evolution between applications in a transactionally consistent manner.
        • Creates the necessity for data staging areas and this extra layer makes project pipelines brittle

Brands

  • Hadoop
    • Traditional format for data lakes
  • Amazon S3
    • Add a hash to your bucket names and explicitly specify your bucket region!!
      • Some dude casually named his bucket and was charged >$1K in a day, because some software was unintentionally hitting his bucket. Even if they don’t have access, just trying to access it creates charges. (link)
      • Evidently AWS is coming up with a solution (link)
    • Try to stay <1000 entries per level of hierarchy when designing the partitioning format. Otherwise there is paging and things get expensive.
    • AWS Athena ($5/TB scanned)
      • AWS Athena is serverless and intended for ad-hoc SQL queries against data on AWS S3
  • Microsoft Azure Data Lake Storage (ADLS)
  • Minio
    • Open-Source alternative to AWS S3 storage.
    • Given that S3 often stores customer PII (either inadvertently via screenshots or actual structured JSON files), Minio is a great alternative to companies mindful of who has access to user data.
      • Of course, AWS claims that AWS personnel doesn’t have direct access to customer data, but by being closed-source, that statement is just a function of trust.
  • Databricks Delta Lake
  • Google Cloud Storage
    • 5 GB of US regional storage free per month, not charged against your credits.
  • Apache Hudi - A transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.

Apache Iceberg

  • Open source table format that addresses the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments.
    • Other currently popular open table formats are Hudi and Delta Lake.
  • Interfaces
    • DuckDB can query Iceberg tables in S3 with an extension, docs
    • Athena can create Iceberg Tables
    • Google Cloud Storage has something called BigLake that can create Iceberg tables
  • Features
    • Transactional consistency between multiple applications where files can be added, removed or modified atomically, with full read isolation and multiple concurrent writes
    • Full schema evolution to track changes to a table over time
    • Time travel to query historical data and verify changes between updates
    • Partition layout and evolution enabling updates to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories
    • Rollback to prior versions to quickly correct issues and return tables to a known good state
    • Advanced planning and filtering capabilities for high performance on large data volumes
    • The full history is maintained within the Iceberg table format and without storage system dependencies
  • Components
    • Iceberg Catalog - Used to map table names to locations and must be able to support atomic operations to update referenced pointers if needed.
    • Metadata Layer (with metadata files, manifest lists, and manifest files) - Stores instead all the enriching information about the constituent files for every different snapshot/transaction
      • e.g. table schema, configurations for the partitioning, etc.
    • Data Layer - Associated with the raw data files
  • Supports common industry-standard file formats, including Parquet, ORC and Avro
  • Supported by major data lake engines including Dremio, Spark, Hive and Presto
  • Queries on tables that do not use or save file-level metadata (e.g., Hive) typically involve costly list and scan operations
  • Any application that can deal with parquet files can use Iceberg tables and its API in order to query more efficiently
  • Comparison

Lakehouse

  • The key idea behind a Lakehouse is to be able to take the best of a Data Lake and a Data Warehouse.
    • Data Lakes can in fact provide a lot of flexibility (e.g. handle structured and unstructured data) and low storage cost.
    • Data Warehouses can provide really good query performance and ACID guarantees.