Alex Merced || 2023-05-10T12:12:03.284Z

data engineering || data engineering - data lakehouse - dremio

Data Quality, Governance, Observability, and Disaster Recovery are issues that are still trying to discover best practices in the world of the data lakehouse. A new trend is rising, borrowing from the practices used by software developers to manage these issues with code bases. This trend is called “Data as Code”. Many of the practices this trend is trying to bring to the Lakehouse include:

  • Versioning to enable isolating work on branches or marking particular reproducible states through tagging
  • Commits to enable time travel and rollbacks
  • The ability to use branching and merging to make atomic changes to multiple objects at the same time ( in data terms, multi-table transactions)
  • Capturing data in commits to build audibility of who is making what changes and when
  • The ability to govern who can access
  • Automating the integration of changes (Continuous Integration) and automating publishing of those changes (Continuous Deployment) via CI/CD pipelines

Project Nessie

Several solutions are arising in approaching this problem from different layers, such as the catalog, table, and file levels. Project Nessie is an open-source project that solves these problems from the catalog level. Benefits of Nessie’s particular approach:

  • Isolate ingestion across your entire catalog by branching it, allowing you to audit and inspect data before publishing without exposing it to consumers and without having to make a “staging” copy of the data (branches do not create data copies like git branches don’t duplicate your code).
  • Make changes to multiple tables from a branch, then merge those changes as one significant atomic multi-table transaction.
  • If a job fails or works in unintended ways, instead of rolling back several tables individually, you can roll back all your tables by rolling back your catalog.
  • Manage access to the catalog, limiting which branches/tables a user can access and what kind of operations they can run on it.
  • Commit logs can be used as an audit log to have visibility to your catalog updates.
  • Nessie operations can all be done via SQL, making it more accessible to data consumers.
  • Portability of your tables as they can be accessed by any tool with Nessie support, such as Apache Spark, Apache Flink, Dremio, Presto, and more.

Project Nessie Resources

Tutorials:

Dremio Arctic

While you can deploy your own Nessie server, you can have a cloud-managed one with some extra features using the Dremio Arctic service. Beyond the amazing catalog-level versioning features that you get with having a Nessie catalog for your tables, Dremio Arctic also provides:

  • Automatic table optimization services
  • Easy and Intuitive UI to view commit logs, manage branches, and more
  • Easy integration with the Dremio Sonar Lakehouse query engine
  • Zero Cost to get a catalog up and running in moments with a Dremio Cloud account

Dremio Arctic Resources

CI/CD

Essentially you can create automated pipelines that take advantage of Nessies branching using any tool that supports Nessie for example:

  • Orchestration Tools
  • CRON Jobs
  • Severless Functions

These mechanisms can be used to send instructions to Nessie supporting tools like Dremio and Apache Spark. For example:

  • Data Lands on S3 triggering a python scripts that sends the appropriate SQL queries to Dremio via Arrow Flight, ODBC or REST
  • A pySpark script that runs on a schedule sending instructions to a Spark script

The jobs would follow a similar pattern too:

  • Create a branch
  • Switch to the branch
  • make updates
  • validate updates
  • if validations are successful, merge changes
  • if validations fail, generate error with details for remediation (consumers never exposed to inconsistent or incorrect data)

Resources on CI/CD with Arctic/Nessie