Alex Merced || 2024-10-31
Data Lakehouse || data lakehouse - data engineering - apache iceberg
- Free Copy of Apache Iceberg the Definitive Guide
- Free Apache Iceberg Crash Course
- Iceberg Lakehouse Engineering Video Playlist
Efficiently managing and analyzing data is essential for business success, and the data lakehouse architecture is leading the way in making this easier and more cost-effective. By combining the flexibility of data lakes with the structured performance of data warehouses, lakehouses offer a powerful solution for data storage, querying, and governance.
For this hands-on guide, we’ll dive into setting up a data lakehouse on your own laptop in just ten minutes using Dremio, Nessie, and Apache Iceberg. This setup will enable you to perform analytics on your data seamlessly and leverage a versioned, Git-like approach to data management with pre-configured storage buckets for simplicity.
Tools We’ll Use:
- Dremio: A lakehouse platform that organizes, documents, and queries data from databases, data warehouses, data lakes and lakehouse catalogs in a unified semantic layer, providing seamless access to data for analytics and reporting.
- Nessie: A transactional catalog that enables Git-like branching and merging capabilities for data, allowing for easier experimentation and version control.
- Apache Iceberg: A data lakehouse table format that turns your data lake into an ACID-compliant structure, supporting operations like time travel, schema evolution, and advanced partitioning.
By the end of this tutorial, you’ll be ready to set up a local lakehouse environment quickly, complete with sample data to explore. Let’s get started and see how easy it can be to work with Dremio and Apache Iceberg on your laptop!
Environment Setup
Before diving into the data lakehouse setup, let’s ensure your environment is ready. We’ll use Docker, a tool that allows you to run applications in isolated environments called “containers.” If you’re new to Docker, don’t worry—this guide will walk you through each step!
Step 1: Install Docker
- Download Docker: Go to docker.com and download Docker Desktop for your operating system (Windows, macOS, or Linux).
- Install Docker: Follow the installation instructions for your operating system. This will include some on-screen prompts to complete the installation process.
- Verify Installation: After installing Docker, open a terminal (Command Prompt, PowerShell, or a terminal app on Linux/macOS) and type:
docker --version
This command should display the version number if Docker is successfully installed.
Once Docker is installed and running, you’ll have the core tool needed to set up our data lakehouse.
Step 2: Create a Docker Compose File
With Docker installed, let’s move on to Docker Compose, a tool that helps you define and manage multiple containers with a single configuration file. We’ll use it to set up and start Dremio, Nessie, and MinIO (an S3-compatible storage solution). Docker Compose will also automatically create the storage “buckets” needed in MinIO, so you won’t need to configure them manually.
Open a Text Editor: Open any text editor (like VS Code, Notepad, or Sublime Text) and create a new file called docker-compose.yml in a new, empty folder. This file will contain all the configuration needed to launch our environment.
Add the Docker Compose Configuration: Copy the following code and paste it into the docker-compose.yml file:
version: "3"
services:
# Nessie Catalog Server Using In-Memory Store
nessie:
image: projectnessie/nessie:latest
container_name: nessie
networks:
- iceberg
ports:
- 19120:19120
# MinIO Storage Server
## Creates two buckets named lakehouse and lake
## tail -f /dev/null is to keep the container running
minio:
image: minio/minio:latest
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
networks:
- iceberg
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
entrypoint: >
/bin/sh -c "
minio server /data --console-address ':9001' &
sleep 5 &&
mc alias set myminio http://localhost:9000 admin password &&
mc mb myminio/lakehouse &&
mc mb myminio/lake &&
tail -f /dev/null
"
# Dremio
dremio:
platform: linux/x86_64
image: dremio/dremio-oss:latest
ports:
- 9047:9047
- 31010:31010
- 32010:32010
container_name: dremio
environment:
- DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
networks:
- iceberg
networks:
iceberg:
Explanation of the Code:
This file defines three services:
- nessie (the catalog)
- minio (the storage server)
- dremio (the query engine).
Each service has specific network settings, ports, and configurations to allow them to communicate with each other.
Step 3: Start Your Environment
With your docker-compose.yml file saved, it’s time to start your data lakehouse environment!
Open a Terminal: Navigate to the folder where you saved the docker-compose.yml file.
Run Docker Compose: In your terminal, type:
docker-compose up -d
This command tells Docker to start each of the services specified in docker-compose.yml and run them in the background (the -d flag).
Wait for Setup to Complete: It may take a few minutes for all services to start. You’ll see a lot of text in your terminal as each service starts up. When you see lines indicating that each service is “running,” the setup is complete.
Step 4: Verify Each Service is Running
Now that the environment is up, let’s verify that each service is accessible:
- Dremio: Open a web browser and go to http://localhost:9047. You should see a Dremio login screen.
- MinIO: In a new browser tab, go to http://localhost:9001. Log in with the username admin and password password. You should see the MinIO console, where you can view storage “buckets.”
Step 5: Optional - Shutting Down the Environment
When you’re done with the setup and want to stop the services, simply open a terminal in the same folder where you created the docker-compose.yml
file and run:
docker-compose down -v
This command will stop and remove all containers, so you can start fresh next time.
The -v
flag removes any volumes associated with the containers, which is important if you want to start fresh next time.
Congratulations! You now have a fully functional data lakehouse environment running on your laptop. In the next section, we’ll connect Dremio to Nessie and MinIO and start creating and querying tables.
Getting Started with Dremio: Connecting the Nessie and MinIO Sources
Now that Dremio is up and running, let’s connect it to our MinIO buckets, lakehouse
and lake
, which will act as the main data sources in our local lakehouse environment. This section will guide you through connecting both the Nessie catalog (using the lakehouse
bucket) and a general S3-like data lake connection (using the lake
bucket) in Dremio.
Step 1: Adding the Nessie Source in Dremio
-
Open Dremio: In your web browser, navigate to http://localhost:9047 to access the Dremio UI.
-
Add the Nessie Source:
- Click on the “Add Source” button in the bottom left corner of the Dremio interface.
- Select Nessie from the list of available sources.
-
Configure the Nessie Source:
- You’ll need to fill out both the General and Storage settings as follows:
General Settings:
- Name: Set the source name to
lakehouse
. - Endpoint URL: Enter the Nessie API endpoint URL:
http://nessie:19120/api/v2
- Authentication: Select None (no additional credentials are required).
Storage Settings:
- Access Key: Set to
admin
(MinIO username). - Secret Key: Set to
password
(MinIO password). - Root Path: Set to
lakehouse
(this is the bucket where our Iceberg tables will be stored).
Connection Properties:
- fs.s3a.path.style.access: Set this to
true
. - fs.s3a.endpoint: Set to
minio:9000
. - dremio.s3.compat: Set to
true
. - Encrypt Connection: Uncheck this option since we’re running Nessie locally on HTTP.
-
Save the Source: Once all settings are configured, click Save. The
lakehouse
source will now be connected in Dremio, allowing you to browse and query tables stored in the Nessie catalog.
Step 2: Adding MinIO as an S3 Source in Dremio (Data Lake Connection)
In addition to Nessie, we’ll set up a general-purpose data lake connection using the lake
bucket in MinIO. This bucket can store non-Iceberg table data, making it suitable for raw data or other types of files. So if you wanted to upload CSV, JSON, XLS or Parquet files you can put them in the “lake” bucket and view them from this source in Dremio.
-
Add an S3 Source:
- Click the “Add Source” button again and select S3 from the list of sources.
-
Configure the S3 Source for MinIO:
- Use the following settings to connect the
lake
bucket as a secondary source.
General Settings:
- Name: Set the source name to
lake
. - Credentials: Choose AWS access key.
- Access Key: Set to
admin
(MinIO username). - Secret Key: Set to
password
(MinIO password). - Encrypt Connection: Uncheck this option since MinIO is running locally.
Advanced Options:
- Enable Compatibility Mode: Set to
true
to ensure compatibility with MinIO. - Root Path: Set to
/lake
(the bucket name for general storage).
Connection Properties:
- fs.s3a.path.style.access: Set this to
true
. - fs.s3a.endpoint: Set to
minio:9000
.
- Use the following settings to connect the
-
Save the Source: After filling out the configuration, click Save. The
lake
bucket is now accessible in Dremio, and you can query the raw data stored in this bucket.
Next Steps
With both sources connected, you now have access to structured, versioned data in the lakehouse
bucket and general-purpose data in the lake
bucket. In the next section, we’ll explore creating and querying Apache Iceberg tables in Dremio to see how easy it is to get started with data lakehouse workflows.
Running Transactions on Apache Iceberg Tables and Inspecting the Storage
With our environment set up and sources connected, we’re ready to perform some transactions on an Apache Iceberg table in Dremio. After creating and inserting data, we’ll inspect MinIO to see how Dremio stores files in the lakehouse
bucket. Additionally, we’ll make a curl
request to Nessie to check the catalog state, confirming our transactions.
Step 1: Creating an Iceberg Table in Dremio
-
Open the SQL Editor in Dremio:
- In the Dremio UI, select SQL Runner from the menu on the left.
-
Set the Context to Nessie:
- In the SQL editor, click on Context (top right of the editor) and set it to our Nessie source
lakehouse
. If you don’t do this then you’ll need to include fully qualified table names in your queries likelakehouse.customers
.
- In the SQL editor, click on Context (top right of the editor) and set it to our Nessie source
-
Create an Iceberg Table:
- Run the following SQL to create a new table named
customers
in thelakehouse
bucket:CREATE TABLE customers ( id INT, first_name VARCHAR, last_name VARCHAR, age INT ) PARTITION BY (truncate(1, last_name));
- This SQL creates an Apache Iceberg table with a partition on the first letter of
last_name
. The partitioning is handled by Apache Iceberg’s Hidden Partitioning feature, which allows for advanced partitioning without additional columns in the schema.
- Run the following SQL to create a new table named
-
Insert Data into the Table:
- Now, add some sample data to the
customers
table:INSERT INTO customers (id, first_name, last_name, age) VALUES (1, 'John', 'Doe', 28), (2, 'Jane', 'Smith', 34), (3, 'Alice', 'Johnson', 22), (4, 'Bob', 'Williams', 45), (5, 'Charlie', 'Brown', 30);
- This will insert five records into the
customers
table, each automatically stored and partitioned in thelakehouse
bucket.
- Now, add some sample data to the
Step 2: Inspecting Files in MinIO
With data inserted into the customers
table, let’s take a look at MinIO to verify the files were created as expected.
-
Open MinIO:
- Go to http://localhost:9001 in your browser, and log in with:
- Username:
admin
- Password:
password
- Username:
- Go to http://localhost:9001 in your browser, and log in with:
-
Navigate to the
lakehouse
Bucket:- From the MinIO dashboard, click on Buckets and select the
lakehouse
bucket. - Inside the
lakehouse
bucket, you should see a directory for thecustomers
table. - Browse through the folders to locate the partitioned files based on the
last_name
column. You’ll find subfolders that store the data by partition, along with metadata files that track the state of the table.
- From the MinIO dashboard, click on Buckets and select the
This inspection verifies that Dremio is writing data to the lakehouse
bucket in Apache Iceberg format, which organizes the data into Parquet files and metadata files.
Step 3: Checking the State of the Nessie Catalog with curl
Now, let’s make a curl
request to the Nessie catalog to confirm that the customers
table was created successfully and that its metadata is stored correctly.
-
Open a Terminal:
- In your terminal, run the following command to view the contents of the main branch in Nessie:
curl -X GET "http://localhost:19120/api/v2/trees/main/entries"
- This command retrieves a list of all entries (tables) in the
main
branch of the Nessie catalog.
- In your terminal, run the following command to view the contents of the main branch in Nessie:
-
Review the Response:
- The JSON response will contain details about the
customers
table. You should see an entry indicating the presence ofcustomers
in the catalog, confirming that the table is tracked in Nessie.
- The JSON response will contain details about the
-
Inspect Specific Commit History (Optional):
- To view the specific commit history for transactions on this branch, you can run:
curl -X GET "http://localhost:19120/api/v2/trees/tree/main/log" \ -H "Content-Type: application/json"
- This command shows a log of all changes made on the
main
branch, providing a Git-like commit history for your data transactions.
- To view the specific commit history for transactions on this branch, you can run:
Next Steps
Now that you have verified your transactions and inspected the storage, you can confidently work with Apache Iceberg tables in Dremio, knowing that both the data and metadata are tracked in the Nessie catalog and accessible in MinIO. In the next section, we’ll explore making additional table modifications, like updating partitioning rules, and see how Apache Iceberg handles these changes seamlessly.
Modifying the Apache Iceberg Table Schema and Partitioning
With our initial customers
table set up in Dremio, we can take advantage of Apache Iceberg’s flexibility to make schema and partition modifications without requiring a data rewrite. In this section, we’ll add a new column to the table, adjust partitioning, and observe how these changes reflect in MinIO and the Nessie catalog.
Step 1: Adding a New Column
Suppose we want to add a new column to store customer email addresses. We can easily update the table schema with the following ALTER TABLE
statement:
-
Open the SQL Editor in Dremio:
- Navigate back to the SQL Runner.
-
Add the Column:
- Run the following SQL to add an
email
column to thecustomers
table:ALTER TABLE customers ADD COLUMNS (email VARCHAR);
- This command adds the
email
column to the existing table without affecting the existing data.
- Run the following SQL to add an
-
Verify the Column Addition:
- After running the command, you can confirm the addition by querying the
customers
table in Dremio:SELECT * FROM customers;
- You’ll see an
email
column now appears, ready for data to be added.
- After running the command, you can confirm the addition by querying the
Step 2: Updating Partitioning Rules
Iceberg allows for flexible partitioning rules through Partition Evolution, meaning we can change how data is partitioned without rewriting all existing data. Let’s add a new partition rule that organizes data based on the first letter of the first_name
as well.
-
Add a Partition Field:
- To partition data by the first letter of
first_name
, use the following SQL:ALTER TABLE customers ADD PARTITION FIELD truncate(1, first_name);
- This command instructs Iceberg to partition any new data by both the first letters of
last_name
andfirst_name
.
- To partition data by the first letter of
-
Insert Additional Data to Test the New Partitioning:
- Let’s insert some more records to see how the new partition structure organizes the data:
INSERT INTO customers (id, first_name, last_name, age, email) VALUES (6, 'Emily', 'Adams', 29, 'emily.adams@example.com'), (7, 'Frank', 'Baker', 35, 'frank.baker@example.com'), (8, 'Grace', 'Clark', 41, 'grace.clark@example.com');
- This data will be partitioned according to both
first_name
andlast_name
, following the new rules we set.
- Let’s insert some more records to see how the new partition structure organizes the data:
Step 3: Inspect the New Partitions in MinIO
-
Open MinIO and navigate to the
lakehouse
bucket:- Go to http://localhost:9001, and log in with:
- Username:
admin
- Password:
password
- Username:
- Go to http://localhost:9001, and log in with:
-
Locate the Updated
customers
Folder:- Within the
lakehouse
bucket, locate thecustomers
table folder. - Open the folder structure to view the newly created subfolders, representing the partitioning by
last_name
andfirst_name
that we configured. You should see the additional folders and Parquet files for each new partition based onfirst_name
.
- Within the
Step 4: Confirm the Changes in Nessie with curl
Finally, let’s make a curl
request to the Nessie catalog to verify that the schema and partitioning changes are recorded in the catalog’s metadata.
- Open a Terminal and run the following command to check the schema:
curl -X GET "http://localhost:19120/api/v2/trees/main/history"
This will return a JSON response listing the recent commits to the main
branch, including the schema and partitioning updates.
Summary
We’ve successfully modified the schema and partitioning of an Apache Iceberg table in Dremio, and we can observe these changes directly in MinIO’s file structure and the Nessie catalog’s metadata. This example demonstrates the flexibility of Iceberg in managing evolving data schemas and partitioning strategies in real-time, without requiring downtime or data rewrites. In the next section, we’ll explore how to utilize Iceberg’s version control capabilities for branching and merging datasets within the Nessie catalog.
Branching and Merging with Nessie: Version Control for Data
One of the powerful features of using Nessie with Apache Iceberg is its Git-like branching and merging functionality. Branching allows you to create isolated environments for data modifications, which can then be merged back into the main branch once verified. This section will walk you through creating a branch, performing data modifications within that branch, and then merging those changes back to the main branch.
Step 1: Creating a Branch
Let’s start by creating a new branch in Nessie. This branch will allow us to perform data transactions without impacting the main data branch, ideal for testing and experimenting.
-
Open the SQL Editor in Dremio.
-
Create a New Branch:
- Run the following SQL to create a new branch named
development
in thelakehouse
catalog:CREATE BRANCH development IN lakehouse;
- This command creates a new branch in the Nessie catalog, providing an isolated environment for data changes.
- Run the following SQL to create a new branch named
-
Switch to the Development Branch:
- Now, let’s set our context to the
development
branch either using the context selector or using the following sql before any queries so that any changes we make only affect this branch:USE BRANCH development IN lakehouse;
- Now, let’s set our context to the
Step 2: Performing Data Modifications on the Branch
With the development
branch active, let’s modify the customers
table by adding new data. This data will remain isolated on the development
branch until we choose to merge it back to main
.
-
Insert Additional Records:
- Run the following SQL to add new entries to the
customers
table (make sure to either use the context selector or use theuse branch
sql before any queries so that any changes we make only affect this branch):INSERT INTO customers (id, first_name, last_name, age, email) VALUES (9, 'Holly', 'Grant', 31, 'holly.grant@example.com'), (10, 'Ian', 'Young', 27, 'ian.young@example.com'), (11, 'Jack', 'Diaz', 39, 'jack.diaz@example.com');
- These records are added to the
customers
table on thedevelopment
branch only, meaning they won’t affect the main branch until merged.
- Run the following SQL to add new entries to the
-
Verify the Records in the Development Branch:
- You can verify the new records by running:
SELECT * FROM customers AT BRANCH development; SELECT * FROM customers AT BRANCH main;
- This query will display the data, including the recently inserted records, as it is within the context of the
development
andmain
branches.
- You can verify the new records by running:
Step 3: Merging Changes Back to the Main Branch
Once satisfied with the changes in development
, we can merge the development
branch back into main
, making these records available to all users accessing the main branch.
-
Switch to the Main Branch:
- First, change the context back to the
main
branch:USE BRANCH main IN lakehouse;
- First, change the context back to the
-
Merge the Development Branch:
- Now, merge the
development
branch intomain
using the following SQL:MERGE BRANCH development INTO main IN lakehouse;
- This command brings all changes from
development
intomain
, adding the new records to the main version of thecustomers
table.
- Now, merge the
-
Verify the Merge:
- To confirm the records are now in
main
, run:SELECT * FROM customers AT BRANCH main;
- You should see all records, including those added in the
development
branch, are now present in themain
branch.
- To confirm the records are now in
Step 4: Verifying the Branching Activity in Nessie with curl
You can use curl
commands to check the branch status and view commit logs in Nessie, providing additional validation of the branching and merging activity.
-
List Branches:
- Run the following
curl
command to list all branches in thelakehouse
catalog:curl -X GET "http://localhost:19120/api/v2/trees/"
- The response will include the
main
anddevelopment
branches, confirming the branch creation.
- Run the following
-
Check the Commit Log:
-
To view a log of commits, including the merge from
development
tomain
, run:curl -X GET "http://localhost:19120/api/v2/trees/main/history" curl -X GET "http://localhost:19120/api/v2/trees/development/history"
-
This log will show each commit, giving you a clear view of data versioning over time.
-
Summary
Branching and merging in Nessie allows you to safely experiment with data modifications in an isolated environment, integrating those changes back into the main dataset only when ready. This workflow is invaluable for testing data updates, creating data snapshots, or managing changes for compliance purposes. In the next section, we’ll explore how to use Nessie tags to mark important states in your data, further enhancing data version control.
Tagging Important States with Nessie: Creating Data Snapshots
In addition to branching, Nessie also offers the ability to tag specific states of your data, making it easy to create snapshots at critical moments. Tags allow you to mark key data versions—such as a quarterly report cutoff or pre-migration data state—so you can refer back to them if needed.
In this section, we’ll walk through creating tags in Nessie to capture the current state of the data and explore how to use tags for historical analysis or recovery.
Step 1: Creating a Tag
Let’s create a tag on the main
branch to mark an important point in the dataset, such as the completion of initial data loading. This tag will serve as a snapshot that we can return to if necessary.
- Open the SQL Editor in Dremio.
- Create a Tag:
- Run the following SQL command to create a tag called
initial_load
on themain
branch:CREATE TAG initial_load AT BRANCH main IN lakehouse;
- This tag marks the state of all tables in the
lakehouse
catalog on themain
branch at the current moment, capturing the data exactly as it is now.
- Run the following SQL command to create a tag called
Step 2: Modifying the Data on the Main Branch
To understand the usefulness of tags, let’s make a few changes to the customers
table on the main
branch. Later, we can use the tag to compare or even restore to the original dataset state if needed.
-
Insert Additional Records:
- Add some new data to the
customers
table to simulate further data processing:INSERT INTO customers (id, first_name, last_name, age, email) VALUES (12, 'Kate', 'Morgan', 45, 'kate.morgan@example.com'), (13, 'Luke', 'Rogers', 33, 'luke.rogers@example.com');
- Add some new data to the
-
Verify Changes:
- Run the following query to confirm that the new records have been added:
SELECT * FROM customers;
- Run the following query to confirm that the new records have been added:
Step 3: Accessing Data from a Specific Tag
Tags in Nessie allow you to view the dataset as it was at the time the tag was created. To access the data at the initial_load
state, we can specify the tag as the reference point in our queries.
-
Query the Data Using the Tag:
- Use the following SQL command to switch to the
initial_load
tag and view the dataset as it was at that point:USE TAG initial_load IN lakehouse; SELECT * FROM customers;
- This query will display the
customers
table as it was when theinitial_load
tag was created, without the new records that were added afterward.
- Use the following SQL command to switch to the
-
Return to the Main Branch:
- Once you are done exploring the
initial_load
state, switch back to themain
branch to continue working with the latest data:USE BRANCH main IN lakehouse;
- Once you are done exploring the
Step 4: Verifying the Tag Creation with curl
To verify the tag’s existence in the Nessie catalog, we can make a curl
request to list all tags, including initial_load
.
-
List Tags:
- Run the following
curl
command to retrieve all tags in thelakehouse
catalog:curl -X GET "http://localhost:19120/api/v2/trees/tags" \ -H "Content-Type: application/json"
- The JSON response will list all tags, including the
initial_load
tag you created.
- Run the following
-
Review Tag Details (Optional):
- To get detailed information about the
initial_load
tag, including its exact commit reference, you can use:curl -X GET "http://localhost:19120/api/v2/trees/tags/initial_load" \ -H "Content-Type: application/json"
- To get detailed information about the
Summary
Tags in Nessie provide a reliable way to snapshot important states of your data. By creating tags at critical points, you can easily access previous states of your data, helping to support data auditing, historical reporting, and data recovery. In the next section, we’ll cover querying the Apache Iceberg Metadata tables.
Exploring Iceberg Metadata Tables in Dremio
Iceberg metadata tables offer insights into the underlying structure and evolution of your data. These tables contain information about data files, snapshots, partition details, and more, allowing you to track changes, troubleshoot issues, and optimize queries. Dremio makes querying Iceberg metadata simple, giving you valuable context on your data lakehouse.
In this section, we’ll explore the following Iceberg metadata tables:
table_files
: Lists data files and their statistics.table_history
: Displays historical snapshots.table_manifests
: Shows metadata about manifest files.table_partitions
: Provides details on partitions.table_snapshot
: Shows information on each snapshot.
Step 1: Querying Data File Metadata with table_files
The table_files
metadata table provides details on each data file in the table, such as the file path, size, record count, and more. This is useful for understanding storage distribution and optimizing queries.
- Query the Data Files:
- Run the following SQL command to retrieve data file information for the
customers
table:SELECT * FROM TABLE(table_files('customers'));
- You’ll see results with columns like
file_path
,file_size_in_bytes
,record_count
, and more, giving insights into each file’s specifics.
- Run the following SQL command to retrieve data file information for the
Step 2: Exploring Table History with table_history
Iceberg tracks the history of a table’s snapshots, which allows you to review past states or even perform time-travel queries. The table_history
table displays each snapshot’s ID and timestamp.
- Query the Table History:
- Use the following SQL to retrieve the history of the
customers
table:SELECT * FROM TABLE(table_history('customers'));
- This query will return a list of snapshots, showing when each snapshot was created (
made_current_at
), thesnapshot_id
, and anyparent_id
linking to previous snapshots.
- Use the following SQL to retrieve the history of the
Step 3: Analyzing Manifests with table_manifests
Manifest files are metadata files in Iceberg that track changes in data files. The table_manifests
table lets you inspect details like the number of files added or removed per snapshot, helping you monitor data evolution and resource usage.
- Query Manifest Metadata:
- Run the following SQL to view manifest metadata for the
customers
table:SELECT * FROM TABLE(table_manifests('customers'));
- The results will include fields like
path
,added_data_files_count
, anddeleted_data_files_count
, which show how each manifest contributes to the table’s state.
- Run the following SQL to view manifest metadata for the
Step 4: Reviewing Partition Information with table_partitions
The table_partitions
table provides details on each partition in the table, including the number of records and files in each partition. This helps with understanding how data is distributed across partitions and can be used to fine-tune partitioning strategies.
- Query Partition Statistics:
- Run the following query to get partition statistics for the
customers
table:SELECT * FROM TABLE(table_partitions('customers'));
- You’ll see fields such as
partition
,record_count
, andfile_count
, which show the breakdown of data across partitions, helping identify skewed partitions or performance bottlenecks.
- Run the following query to get partition statistics for the
Step 5: Examining Snapshots with table_snapshot
The table_snapshot
table provides a summary of each snapshot, including the operation (e.g., append
, overwrite
), the commit timestamp, and any manifest files associated with the snapshot.
- Query Snapshot Information:
- Run the following SQL to see snapshot details for the
customers
table:SELECT * FROM TABLE(table_snapshot('customers'));
- The result will include fields like
committed_at
,operation
, andsummary
, providing a high-level view of each snapshot and its impact on the table.
- Run the following SQL to see snapshot details for the
Using Metadata for Time-Travel Queries
The Iceberg metadata tables also support time-travel queries, enabling you to query the data as it was at a specific snapshot or timestamp. This can be especially useful for auditing, troubleshooting, or recreating analysis from past data states.
- Perform a Time-Travel Query:
- Suppose you want to view the data in the
customers
table at a specific snapshot. First, retrieve thesnapshot_id
using thetable_history
ortable_snapshot
table. - Then, run a query like the following to access data at that snapshot:
SELECT * FROM customers AT SNAPSHOT '<snapshot_id>';
- Replace
<snapshot_id>
with the ID from the metadata tables to view the data as it was at that specific point.
- Suppose you want to view the data in the
Summary
Iceberg metadata tables in Dremio provide a wealth of information on table structure, partitioning, and versioning. These tables are essential for monitoring table evolution, diagnosing performance issues, and executing advanced analytics tasks like time travel.
Conclusion
Congratulations! You’ve just set up a powerful data lakehouse environment on your laptop with Apache Iceberg, Dremio, and Nessie, and explored hands-on techniques for managing and analyzing data. By leveraging the strengths of these open-source tools, you now have the flexibility of data lakes with the performance and reliability of data warehouses—right on your local machine.
From creating and querying Iceberg tables to managing branches and snapshots with Nessie’s Git-like controls, you’ve seen how this stack can simplify complex data workflows. Using Dremio’s intuitive interface, you connected sources, ran queries, explored metadata, and learned how to use Iceberg’s versioning and partitioning capabilities for powerful insights. Iceberg metadata tables also provide detailed information on data structure, making it easy to track changes, optimize storage, and even run time-travel queries.
This hands-on setup is just the beginning. As your data grows, you can explore Dremio’s cloud deployment options and advanced features like reflections and incremental refreshes for scaling analytics. By mastering this foundational environment, you’re well-prepared to build efficient, scalable data lakehouse solutions that balance data accessibility, cost savings, and performance.
If you enjoyed this experience, consider diving deeper into Dremio Cloud or exploring further capabilities with Iceberg and Nessie by deploying a self-managed single node instance. Happy querying!