Alex Merced || 2024-10-05
Data Lakehouse || data lakehouse - data engineering - apache iceberg
This article is a comprehensive directory of Apache Iceberg resources, including educational materials, tutorials, and hands-on exercises. Whether you’re a beginner or an experienced data engineer, this guide will help you navigate the world of Apache Iceberg and its applications.
Apache Iceberg?
What is Apache Iceberg?
Apache Iceberg is open-source data lakehouse table format. That means it is a standard for how metadata defining a group of files as a table is stored. This metadata enables the files to be read and written to in the same way as a table in a data warehouses by any tool that supports the standard with the same features and ACID guarantees.
Why Does it Matter?
-
By operating off tables in a seperate storage layer, you can use all your favorite analytical tools on a single copy of your data.
-
Reducing the number of copies needed can reduce your compute costs, storage costs and network costs of your overall data platform.
-
By storing your data in a standard format, it reduces future migration costs when changing tooling or adopting new tools.
Who does Apache Iceberg benefit?
-
Data Engineers since it means less data movement so less data pipelines to manage.
-
Data Analysts since it means they can have more immediate access to data since it requires fewer data movements to make available especially when paired with data virtualization available in tools like Dremio which allows for Lakehouse Querying and Federated Querying (Virtualization) on one platform.
-
Data Scientists cause they can also have more immediate data access when training their AI/ML models.
-
Data Leaders since they can reduce their overall platform costs making it easier to fund other data initiatives.
Apache Iceberg Directory
Apache Iceberg Education
Here is a list of resources to help you learn Apache Iceberg:
- Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?
- Free Copy of Apache Iceberg the Definitive Guide
- Free Apache Iceberg Crash Course
- Iceberg Lakehouse Engineering Video Playlist
Apache Iceberg Hands-on Tutorials
Here is a list of hands-on tutorials that will help you get started with Apache Iceberg:
- Hands-on Intro with Apache iceberg
- Intro to Apache Iceberg, Nessie and Dremio on your Laptop
- JSON/CSV/Parquet to Apache Iceberg to BI Dashboard
- From MongoDB to Apache Iceberg to BI Dashboard
- From SQLServer to Apache Iceberg to BI Dashboard
- From Postgres to Apache Iceberg to BI Dashboard
- Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT
- Elasticsearch to Apache Iceberg to BI Dashboard
- MySQL to Apache Iceberg to BI Dashboard
- Apache Druid to Apache Iceberg to BI Dashboard
- BI Dashboards with Apache Iceberg Using AWS Glue and Apache Superset
- End-to-End Basic Data Engineering Tutorial (Spark, Apache Iceberg Dremio, Superset)
Apache Iceberg’s Architecture
Here is a list of resources to help you learn Apache Iceberg’s architecture and internals:
- The Life of a Read Query for Apache Iceberg Tables
- The Life of a Write Query for Apache Iceberg Tables
- Understanding Apache Iceberg’s Metadata.json
- Understanding the Apache Iceberg Manifest List (Snapshot)
- Understanding the Apache Iceberg Manifest
- Understanding Apache Iceberg Delete Files
- Puffins and Icebergs: Additional Stats for Apache Iceberg Tables
- How Apache Iceberg is Built for Open Optimized Performance
- Ensuring High Performance at Any Scale with Apache Iceberg’s Object Store File Layout
- Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg
- ACID Guarantees and Apache Iceberg: Turning Any Storage into a Data Warehouse
- Apache Iceberg Reliability
Getting Data into Apache Iceberg
Here is a list of resources to help you get data into Apache Iceberg:
- 8 Tools For Ingesting Data Into Apache Iceberg
- Event Based Ingestion for Apache Iceberg Tables
- Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
- How to Create a Lakehouse with Airbyte, S3, Apache Iceberg, and Dremio
- How to Convert JSON Files Into an Apache Iceberg Tables with Dremio
- How to Convert CSV Files into an Apache Iceberg table with Dremio
- Ingesting Data into Apache Iceberg using Fivetran
Apache Iceberg Migration
Here is a list of resources to help you migrate your data to Apache Iceberg:
- Migration Guide for Apache Iceberg Lakehouses
- Apache XTable: Converting Between Apache Iceberg, Delta Lake, and Apache Hudi
- 3 Ways to Convert a Delta Lake Table Into an Apache Iceberg Table
- How to Migrate a Hive Table to an Iceberg Table
- Migrating a Hive Table to an Iceberg Table Hands-on Tutorial
Streaming with Apache Iceberg
Here is a list of resources to help you stream data into Apache Iceberg:
- A Guide to Change Data Capture (CDC) with Apache Iceberg
- Apache Kafka to Apache Iceberg to Dremio
- Streaming and Batch Data Lakehouses with Apache Iceberg, Dremio and Upsolver
- Using Flink with Apache Iceberg and Nessie
- Streaming Data into Apache Iceberg Tables Using AWS Kinesis and AWS Glue
- Adapting Iceberg for high-scale streaming data
Partitioning with Apache Iceberg
Here is a list of resources to help you learn how to partition your data with Apache Iceberg:
- Simplifying Your Partition Strategies with Dremio Reflections and Apache Iceberg
- Partition Evolution: Future-Proof Partitioning and Fewer Table Rewrites with Apache Iceberg
- Fewer Accidental Full Table Scans Brought to You by Apache Iceberg’s Hidden Partitioning
Maintaining and Auditing Apache Iceberg Tables
- Guide to Maintaining an Apache Iceberg Lakehouse
- Compaction in Apache Iceberg: Fine-Tuning Your Iceberg Table’s Data Files
- Leveraging Apache Iceberg Metadata Tables in Dremio for Effective Data Lakehouse Auditing
- What is DataOps? Automating Data Management on the Apache Iceberg Lakehouse
- How Z-Ordering in Apache Iceberg Helps Improve Performance
- Maintaining Iceberg Tables – Compaction, Expiring Snapshots, and More
Apache Iceberg Catalogs
Here is a list of resources to help you learn about Apache Iceberg Catalogs:
- The Evolution of Apache Iceberg Catalogs
- Introducing the Apache Iceberg Catalog Migration Tool
- What Iceberg REST Catalog Is and Isn’t
- Why Thinking about Apache Iceberg Catalogs Like Nessie and Apache Polaris (incubating) Matters
- Using Nessie’s REST Catalog Support for Working with Apache Iceberg Tables
- The Nessie Ecosystem and the Reach of Git for Data for Apache Iceberg
- Introduction to Apache Polaris (incubating) Data Catalog
- Understanding the Polaris Iceberg Catalog and Its Architecture
- Getting Hands-on with Snowflake Managed Polaris
- Getting Hands-on with Polaris OSS, Apache Iceberg and Apache Spark
- The Importance of Versioning in Modern Data Platforms: Catalog Versioning with Nessie vs. Code Versioning with dbt
Querying Apache Iceberg Tables
Here is a list of resources to help you query your Apache Iceberg tables:
- Query Iceberg Tables on MinIO with Dremio
- Run Graph Queries on Apache Iceberg Tables with Dremio & Puppygraph
Hybrid Apache Iceberg Lakehouses
Here is a list of resources about implementing hybrid on-premises and cloud Apache Iceberg lakehouses:
- 3 Reasons to Create Hybrid Apache Iceberg Data Lakehouses
- Hybrid Iceberg Lakehouse Storage Solutions: NetApp
- Hybrid Iceberg Lakehouse Storage Solutions: MinIO
- Hybrid Iceberg Lakehouse Infrastructure Solutions: VAST Data
- Hybrid Lakehouse Storage Solutions: Pure Storage
Apache Iceberg and Other Formats
Here is a list of resources about Apache Iceberg and other formats (Apache Hudi, Apache Paimon, Delta Lake):
- Comparing Apache Iceberg to Other Data Lakehouse Solutions
- Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi
- Comparison of Data Lake Table Formats (Apache Iceberg, Apache Hudi and Delta Lake)
- Table Format Partitioning Comparison: Apache Iceberg, Apache Hudi, and Delta Lake
- Table Format Governance and Community Contributions: Apache Iceberg, Apache Hudi, and Delta Lake
Python and Apache Iceberg
Here is a list of resources about Apache Iceberg and Python:
- 3 Ways to Use Python with Apache Iceberg
- PyIceberg Docs
- Hands-on with Apache Iceberg Tables using PyIceberg using Nessie and Minio
Governing Apache Iceberg Tables
- Apache Iceberg and the Right to Be Forgotten
- A Brief Guide to the Governance of Apache Iceberg Tables
Miscellaneous Apache Iceberg Resources
Here is a list of miscellaneous resources to help you learn Apache Iceberg:
- Introduction to the Iceberg Data Lakehouse
- The Iceberg Lakehouse: Key Benefits for Your Business
- Evolving the Data Lake: From CSV/JSON to Parquet to Apache Iceberg
- Data Sharing of Apache Iceberg tables and other data in the Dremio Lakehouse
- The Value of Dremio’s Semantic Layer and The Apache Iceberg Lakehouse to the Snowflake User
- The Who, What and Why of Data Reflections and Apache Iceberg for Query Acceleration
- How Apache Iceberg, Dremio and Lakehouse Architecture can optimize your Cloud Data Platform Costs
- Dremio’s Commitment to being the Ideal Platform for Apache Iceberg Data Lakehouses
- Open Source and the Data Lakehouse: Apache Arrow, Apache Iceberg, Nessie and Dremio
- The Why and How of Using Apache Iceberg on Databricks
- Deep Dive Into Configuring Your Apache Iceberg Catalog with Apache Spark
- Connecting Tableau to Apache Iceberg Tables with Dremio
- Apache Iceberg 101
- Apache Iceberg FAQ
- Why Data Analysts, Engineers, Architects and Scientists Should Care about Dremio and Apache Iceberg
- Data Lake Mysteries Unveiled: Nessie, Dremio, and MinIO Make Waves