5 Open Source Data Projects You Should Be Following

🗓 2024-03-19👤 Alex Merced⏱ 4 min read

Data Lakehouse #Data Architecture #Apache Iceberg #Data Lakehouse #Open Source

Follow Me On Social Subscribe to my SubStack

Open source technology significantly impacts various development areas, and the data sector is no exception. Today's data landscape features increasingly large datasets that often rely on external sources worldwide, necessitating rapid conversion into insights. The proprietary formats and platforms of the past, with their artificial barriers designed to maintain vendor lock-in, hinder this process. Fortunately, numerous open source projects are revolutionizing the data realm. These projects are utilized by open data platforms like Dremio, among others, with some still in the early stages of their disruptive journey.

Apache Iceberg (Lakehouse Table Format)

The first noteworthy technology is Apache Iceberg, a data lakehouse table format. It introduces a metadata layer over Parquet datasets in your data lake, enabling various tools to interact with them as if they were database tables, complete with ACID transactions, time-travel capabilities, table evolution, and more. Apache Iceberg stands out for its ease of use, particularly with lakehouse platforms like Dremio, its robust ecosystem of compatible tools and integrations, and its community-driven culture. With Apache Iceberg, you can seamlessly work across all your preferred tools and platforms without the need for multiple data copies.

Nessie (Lakehouse Catalog with Git-Like Catalog Versioning)

Project Nessie Logo

Nessie is an open-source lakehouse catalog that enables you to track your Apache Iceberg tables and views within your data lake, integrating them seamlessly with popular tools like Dremio, Apache Spark, Apache Flink, and others. What sets Nessie apart is its catalog versioning capability, which allows you to work on your catalog in isolated branches and publish all changes simultaneously through branching. This feature also facilitates catalog-level rollbacks, enhances data reproducibility with tagging, and supports the creation of branches for experimentation and validation, as well as zero-copy environments for development. While Nessie can be deployed in a self-managed manner, it is also now integrated into the Dremio Cloud Lakehouse platform.

Apache Arrow (Standard In-Memory Format and Transfer Protocol)

Apache Arrow introduces a range of standards to the data arena. Its in-memory format sets a standard for columnar data, enabling rapid analytical processing while reducing the overhead of serialization and deserialization when reading and writing data. The Apache Arrow Flight GRPC protocol facilitates data transfer between systems in the Arrow format, further minimizing the need for conversion and enhancing performance. This protocol can be used directly or through JDBC/ODBC drivers that Dremio has contributed to the community, allowing connections to any Arrow Flight server using a single driver. A notable addition to the project is ADBC (Arrow Database Connectivity), a new transfer protocol that supports various drivers for columnar data transfer. This means you can use an Arrow Flight driver for Arrow Flight servers, while other platforms can develop custom ADBC drivers for their columnar formats, optimizing the benefits of columnar transfer. Essentially, Apache Arrow enables faster data processing and transfer, meeting the increasing demand for speed in data handling.

Ibis

Ibis is a project that separates the Python dataframe API from the compute layer, facilitating a unified dataframe API that can interact with various compute engines. This approach simplifies the process for analysts, allowing them to work with data from different systems using a single dataframe API, streamlining their workflows and enhancing efficiency.

Substrait

Substrait is a distinctive project targeting the standardization of a layer often invisible to users: the algebraic representation of queries. When a query is executed using a query language, it must be compiled into an intermediate format that the engine uses to determine the operations to execute. Typically, this format is unique to each engine, resulting in cross-platform SQL incompatibility. Substrait's goal is to establish a standard intermediate format that bridges the gap between the user's query and the relational algebra that the query engine processes, aiming to streamline and unify the querying process across different platforms.