Lakes
Misc
- Data is stored in structured format or in its raw native format without any transformation at any scale.
- Handling both types allows all data to be centralized which means it can be better organized and more easily accessed.
- Optimal for fit for bulk data types such as server logs, clickstreams, social media, or sensor data.
- Ideal use cases
- Backup for logs
- Raw sensor data for your IoT application,
- Text files from user interviews
- Images
- Trained machine learning models (with the database simply storing the path to the object)
- Tools
- Rclone - A command-line program to manage files on cloud storage. It is a feature-rich alternative to cloud vendors’ web storage interfaces. Over 70 cloud storage products support rclone including S3 object stores
- {pins}, {{pins}}
- Posit’s Pins Docs
- Convenient storage method
- Can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes
- Needs to be manually refreshed
- i.e. update data model, etc. and run script that rights it to the board.
- Use when:
- Object is less than a 1 Gb
- Use {butcher} for large model objects
- Some model objects store training data
- Use {butcher} for large model objects
- Object is less than a 1 Gb
- Benefits
- Just need the pins board name and name of pinned object
- Think the set-up is supposed to be easy
- Easy to share; don’t need to understand databases
- Just need the pins board name and name of pinned object
- Boards
Folders to share on a networked drive or with services like DropBox
Posit Connect, Amazon S3, Google Cloud Storage, Azure storage, Databricks and Microsoft 365 (OneDrive and SharePoint)
Example: Pull data, clean and write to board
<- board ::board_connect( pinsauth = "manual", server = Sys.getenv("CONNECT_SERVER"), key = Sys.getenv("CONNECT_API_KEY") ) # code to pull and clean data ::pin_write(board = board, pinsx = clean_data, name = "isabella.velasquez/shiny-calendar-pin")
- object-store-rs
- Features
- Easy to install with no Python dependencies.
- Sync and async API.
- Streaming downloads with configurable chunking.
- Automatically supports multipart uploads under the hood for large file objects.
- The underlying Rust library is production quality and used in large scale production systems, such as the Rust package registry crates.io.
- Simple API with static type checking.
- Helpers for constructing from environment variables and
boto3.Session
objects
- Supported object storage providers include:
- Amazon S3 and S3-compliant APIs like Cloudflare R2
- Google Cloud Storage
- Azure Blob Gen1 and Gen2 accounts (including ADLS Gen2)
- Local filesystem
- In-memory storage
- Features
- Lower storage costs due to their more open-source nature and undefined structure
- On-Prem set-ups have to manage hardward and environments
- If you wanted to separate stuff like test data from production data, you also probably had to set up new hardware.
- If you had data in one physical environment that had to be used for analytical purposes in another physical environment, you probably had to copy that data over to the new replica environment.
- Have to keep a tie to the source environment to ensure that the stuff in the replica environment is still up-to-date, and your operational source data most likely isn’t in one single environment. It’s likely that you have tens — if not hundreds — of those operational sources where you gather data.
- Where on-prem set-ups focus on isolating data with physical infrastructure, cloud computing shifts to focus on isolating data using security policies.
- Object Storage Systems
- Cloud data lakes provide organizations with additional opportunities to simplify data management by being accessible everywhere to all applications as needed
- Organized as collections of files within directory structures, often with multiple files in one directory representing a single table.
- Pros: highly accessible and flexible
- Metadata Catalogs are used to answer these questions:
- What is the schema of a dataset, including columns and data types
- Which files comprise the dataset and how are they organized (e.g., partitions)
- How different applications coordinate changes to the dataset, including both changes to the definition of the dataset and changes to data
- Hive Metastore (HMS) and AWS Glue Data Catalog are two popular catalog options
- Contain the schema, table structure and data location for datasets within data lake storage
- Issues:
- Does not coordinate data changes or schema evolution between applications in a transactionally consistent manner.
- Creates the necessity for data staging areas and this extra layer makes project pipelines brittle
- Does not coordinate data changes or schema evolution between applications in a transactionally consistent manner.
Brands
- Hadoop
- Traditional format for data lakes
- Amazon S3
- Add a hash to your bucket names and explicitly specify your bucket region!!
- Try to stay <1000 entries per level of hierarchy when designing the partitioning format. Otherwise there is paging and things get expensive.
- AWS Athena ($5/TB scanned)
- AWS Athena is serverless and intended for ad-hoc SQL queries against data on AWS S3
- Microsoft Azure Data Lake Storage (ADLS)
- Minio
- Open-Source alternative to AWS S3 storage.
- Given that S3 often stores customer PII (either inadvertently via screenshots or actual structured JSON files), Minio is a great alternative to companies mindful of who has access to user data.
- Of course, AWS claims that AWS personnel doesn’t have direct access to customer data, but by being closed-source, that statement is just a function of trust.
- Databricks Delta Lake
- Google Cloud Storage
- 5 GB of US regional storage free per month, not charged against your credits.
- Apache Hudi - A transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.
Apache Iceberg
- Open source table format that addresses the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments.
- Other currently popular open table formats are Hudi and Delta Lake.
- Interfaces
- Features
- Transactional consistency between multiple applications where files can be added, removed or modified atomically, with full read isolation and multiple concurrent writes
- Full schema evolution to track changes to a table over time
- Time travel to query historical data and verify changes between updates
- Partition layout and evolution enabling updates to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories
- Rollback to prior versions to quickly correct issues and return tables to a known good state
- Advanced planning and filtering capabilities for high performance on large data volumes
- The full history is maintained within the Iceberg table format and without storage system dependencies
- Components
- Iceberg Catalog - Used to map table names to locations and must be able to support atomic operations to update referenced pointers if needed.
- Metadata Layer (with metadata files, manifest lists, and manifest files) - Stores instead all the enriching information about the constituent files for every different snapshot/transaction
- e.g. table schema, configurations for the partitioning, etc.
- Data Layer - Associated with the raw data files
- Supports common industry-standard file formats, including Parquet, ORC and Avro
- Supported by major data lake engines including Dremio, Spark, Hive and Presto
- Queries on tables that do not use or save file-level metadata (e.g., Hive) typically involve costly list and scan operations
- Any application that can deal with parquet files can use Iceberg tables and its API in order to query more efficiently
- Comparison
Lakehouse
- The key idea behind a Lakehouse is to be able to take the best of a Data Lake and a Data Warehouse.
- Data Lakes can in fact provide a lot of flexibility (e.g. handle structured and unstructured data) and low storage cost.
- Data Warehouses can provide really good query performance and ACID guarantees.
- “Cloudflare R2 with Iceberg or Delta Lake and polars (automated with GitHub actions) is a free data lakehouse.”