Polaris vs. Unity Catalog: Clearing Up the Confusion

Suteja Kanuri
3 min readJun 30, 2024

--

The month of June 2024 has been particularly confusing for data and AI professionals, especially those trying to understand the evolving landscape of data catalogs and metastores. With Apache Iceberg gaining momentum, Snowflake’s polaris catalog being open-sourced, unity catalog also being open-sourced, and Databricks acquiring Tabular, the situation has only grown more complex. In this article, I aim to demystify these developments. Take a read and get clarity.

Definition of catalog

The term “catalog” is often overused and can mean different things in the data platform landscape:

  1. Catalog as a governance layer: Manages data access, lineage, and compliance.
  2. Catalog as a metastore mechanism: Provides transactional guarantees for reading and writing data.

It’s worth noting the differences between various catalogs

A metastore is primarily focused on storing metadata about the structure of data. In contrast, a catalog extends the concept of a metastore to include additional functionality and often focuses on a broader range of data management features, such as data discovery and governance, a unified view across data sources, schema discovery, and fine-grained access control. There are three distinct types of catalogs in the data management space, each addressing different sets of concerns

  1. Data lake catalog(interchangeable with metastore)
  2. Lakehouse Data & AI catalog
  3. Enterprise wide catalogs
  • Datalake catalog/metastore: Apache Hive Metastore (HMS) is a metastore and it lacks many of the extended features found in a typicalcatalog, such as data discovery, unified views across multiple data sources, advanced data governance, and comprehensive data lineage tracking. Amazon Glue, Snowflake Horizon are also some examples.
  • Lakehouse Data & AI catalog: Databricks’ Unity Catalog which is a a bit more than just a metastore and provides comprehensive data management capabilities.
  • Enterprise wide catalogs: These centralize metadata from different metastores, providing a unified view and enhancing overall data management capabilities across the organization. Example: Collibra, Atlan, Ataccama One etc

*Polaris: (neither a catalog nor a metastore) — Polaris is a REST API implementation to manage ACID guarantees for Apache Iceberg tables (delta and hudi dont need this something to note)

**Features of Enterprise wide catalogs >> Lakehouse Data & AI catalog >> Data lake catalog (metastore)

What does Polaris being open sourced mean

There has been no unification to Iceberg catalogs until now. Each catalog in the market, like Glue and HMS, had its own implementation for managing transactional guarantees on Iceberg tables, typically exposing a REST API for this purpose.

With Polaris being open-sourced, there is a promise of unifying iceberg rest API implementation across existing platforms like Glue, HMS, etc. Currently Glue, HMS , each have their own rest API implementation for iceberg data.

This is indeed a significant development.

Unknowns About Polaris (Not yet open sourced)

While the announcement of Polaris being open-sourced is promising, several questions remain unanswered

  • Scope: Is Polaris a full governance layer like Unity Catalog, or just an implementation for the Iceberg format for read/write operations? The launch blog states clearly that Polaris will need to be “integrated” with Horizon for governance(ACL) implementation
  • Integration: How will Polaris integrate with external catalogs like Glue and HMS?
  • Snowflake Compatibility: How will Polaris integrate with Snowflake’s native catalog, Horizon? Will both coexist?
  • Governance and ACID: Will Horizon continue to manage access control while Polaris handles ACID capabilities for Iceberg tables?
  • Redundancy: Will Polaris duplicate existing catalogs like HMS for EMR, and Dataplex for GCP, or Glue for AWS? Will these services have to adopt Polaris?

What does Unity Catalog being open sourced mean

With Databricks’ managed Unity Catalog, you gain comprehensive governance features such as data lineage, observability, monitoring, and auditing capabilities. Permissions can be managed in a cloud-agnostic manner through unity, eliminating the need to delve into IAM or service principal-level permissioning for data and AI assets.

Known unknowns about Unity Catalog OSS

Feature Parity: The open-source version of Unity Catalog does not include many of the bundled features of Databricks’ managed offering, such as data lineage, observability, monitoring, and auditing capabilities.

The open source version is at its first release (0.1), so today, it only supports basic metadata management via open APIs. However, it is not recommended to run it in production, must be first evaluated with the SaaS offering via Databricks.

In conclusion

If you are considering exploring Polaris or Unity Catalog, it’s essential to assess each from different perspectives, as they address distinct aspects of data management. Both have their own caveats, given their recent introduction to the open-source community. Open-source technologies take time to evolve and mature, so it’s a matter of waiting and observing their development and adoption.

--

--