Alex Merced || 2022-07-18
frontend || data engineering - data lake
As a Developer Advocate for Dremio I spend a lot of time doing research on technology and best practices around engineering Data Lakehouses and sharing what I learn through content for Subsurface - The Data Lakehouse Community. One of the major topics I’ve been diving deep into is the topic of Data Lakehouse Table Formats, these allow you to take the files on your data lake and group them into tables data processing engines like Dremio can operate on.
What I’d like to do today is show you how to very quickly get a docker container up and running to get hands on and try Apache Iceberg with Spark, do keep an eye out for an even more in-depth introduction on Subsurface. Before we get into our exercise, here is some content to help get you introduced to Apache Iceberg and the world of Data Lakehouse table formats.
Introduction to Table Formats and Apache Iceberg
- Meetup: Comparison of Data Lakehouse Table Formats
- Meetup: Apache Iceberg and Architectural Look Under the Covers
- DataNation Podcast: Episode of Table Formats
Other Content on Apache Iceberg
- Blog: How maintain Apache Iceberg Tables
- Blog: Apache Iceberg’s Hidden Partitioning
- Blog: Migrating Apache Iceberg tables from Hive
- Blog: Hands-on Hive Migration Exercise
- Blog: Table Format Comparison (Iceberg, Hudi, Delta Lake)
- Blog: Table Format Comparison - Governance
- Blog: Table Format Comparison - Partitioning
Setting Up a Practice Environment
For this tutorial you do need to have Docker installed, as we will be using this docker image I created for easy hands on experimenting with Apache Iceberg, Apache Hudi and Delta Lake.
alexmerced/table-format-playground
You can get this up and running easily with the following command:
docker run -it --name format-playground alexmerced/table-format-playground
Note: This container was built from 64-bit Linux machine, so the image may not work on an M1/ARM chipset. All you have to do is rebuild the image, you can find the dockerfiles for this image in this repo.
Once the docker image is running you can easily open up Spark with any of the table formats with the following commands:
iceberg-init
- to open Spark Shell with Apache Iceberg configuredhudi-init
- to open Spark Shell with Apache Hudi configureddelta-init
- to open Sparh Shell with Delta Lake configured.
This blog will focus on Apache Iceberg, but feel free to play with the other table formats using their documentation.
Getting Hands On with Apache Iceberg
-
Start the Docker Container
docker run -it --name format-playground alexmerced/table-format-playground
-
Open Spark with Iceberg
iceberg-init
Now we are inside of SparkSQL where we can run SQL statements against our Iceberg catalog that was configured by the iceberg-init
script. If you are curious to the settings I used you can run cat iceberg-init.bash
back in terminal.
Creating a Table in Iceberg
Keep in mind, we are not working with a traditional database but with a data lakehouse. So we are creating and reading files that would exist in your data lake storage (AWS/Azure/Google Cloud). So it may feel like working with a traditional database, and that is the beauty that table formats like Iceberg enable, working with files stored in our data lake in the same way we work with data in a database or data warehouse.
To create a new Iceberg table we can just run the following command.
CREATE TABLE iceberg.cool_people (name string) using ICEBERG;
Looks like a regular run of the mill SQL statement (if unfamiliar with SQL, learn more here), but there is a few things to call out.
-
iceberg.cool_people
in Spark we have to configure a name for our catalog of tables, in my script I called it “iceberg”, soiceberg.cool_people
means I’m creating a table calledcool_people
in my catalog callediceberg
. -
using ICEBERG
clause tells Spark to use Iceberg to create the table instead of its default of using Hive.
Note: the time it takes to complete these statements may vary as we’re running it in a docker container on our computer. Spark is software meant to be running on many computers in a cluster, so keep that in mind when working with Spark or any MPP (Massively Parallel Processing) tool on a single computer.
Adding Some Records
Run the following:
INSERT INTO iceberg.cool_people VALUES ("Claudio Sanchez"), ("Freddie Mercury"), ("Cedric Bixler");
Bonus points if you get the musical references
Querying the Records
Run the following:
SELECT * FROM iceberg.cool_people;
Ending the Session
-
To quit out of SparkSQL
exit;
-
To quit out the docker container
exit
If you want to use this container again in the future:
-
docker start format-playground
-
docker attach format-playground
Conclusion
Now you know how to quickly set yourself up so you can experiment with Apache Iceberg. Check out their docs for many of the great features that exist in Iceberg such as Time Travel, Hidden Partitioning, Partition Evolution, Schema Evolution, ACID transactions and more.
One of the best aspects of Iceberg is that so many tools are building support for Iceberg such as Dremio (which is also an Iceberg contributor). Dremio provides the following for working with Iceberg on their Dremio Cloud platform which allows you to create an open data lakehouse free of software/licensing costs enabling companies of any size to start building sophisticated and open data pipelines:
- The ability to query Iceberg tables using the Sonar Query Engine
- Full Iceberg DML to run deletes/updates/upserts on your Iceberg tables from the Sonar Query Engine
- The ability to use a Nessie based catalog for your Iceberg tables using the Arctic Intelligent Metastore (Enables Git like features on your data)
Keep an eye out for more in-depth Iceberg tutorials on the Subsurface Website and make sure to follow me on twitter to not miss any of my future Data Lakehouse content.
Note: for web development content, follow this twitter account