Why Iceberg requires a Catalog for ACID while Delta doesn’t

Suteja Kanuri
5 min readAug 18, 2024

--

Both Apache Iceberg and Delta Lake offer ACID guarantees for Parquet files, each using different mechanisms to achieve this. Data lakes on the cloud lacked these guarantees because cloud storage systems do not natively provide ACID properties

Do cloud storages provide ACID guarantees?

No — Cloud storage systems, such as Amazon S3, Google Cloud Storage, and Azure Blob Storage, were designed to be scalable, durable, and highly available for storing and retrieving large amounts of unstructured data. However, these systems were not originally built with the stringent requirements needed to ensure ACID (Atomicity, Consistency, Isolation, Durability) guarantees in the context of database-like operations on table formats.

The eventual consistency model, lack of atomic operations across multiple objects, absence of native transaction support, and challenges of maintaining consistency in distributed environments make it difficult for cloud storage systems to provide ACID guarantees on their own. As a result, table formats like Delta Lake and Apache Iceberg introduce additional mechanisms — such as transaction logs, metadata layers, and catalogs — to manage these complexities and ensure ACID properties in the cloud.

What about ACID on Local file system?

For Local File Systems, they provide strong consistency and atomic operations by default, so Delta Lake can rely on the native file operations to maintain ACID properties. In contrast, Apache Iceberg has a more complex mechanism for maintaining ACID properties, even on local file systems. Iceberg’s snapshot-based architecture, hierarchical metadata, and reliance on a catalog (even a file-based one) even when the underlying file system provides strong guarantees.

This makes Iceberg more consistent across different environments, whether on cloud storage systems, distributed systems, or local file systems.

I would not go into the details on why ACID guarantees are needed.

Below are the architectural differences between Iceberg and Delta on how they achieve ACID guarantees.

Iceberg uses a snapshot-based architecture with a hierarchical structure of metadata files (e.g., manifest lists, manifest files). Each snapshot captures the state of the table at a specific point in time.To manage these snapshots and the associated metadata, Iceberg requires a catalog that stores the location of the latest metadata file, ensuring that all clients access the correct table state. The catalog acts as the central registry, enabling consistency and facilitating metadata updates across distributed systems.

Delta Lake uses a transaction log (_delta_log) that stores all changes to the table, including data file additions, deletions, schema changes, and other metadata operations. This log is stored alongside the data files, making Delta Lake tables self-contained. The transaction log is the single source of truth for the table’s state. When a query or operation is executed, Delta Lake reads the log to determine the current state of the table, ensuring ACID properties without needing an external catalog to manage metadata.

Delta X ACID guarantees

The LogStore API is a component that is specific to Delta Lake, and it is part of Delta Lake’s architecture. It is not a native or standalone API that exists outside of Delta Lake; rather, it was developed as part of the Delta Lake project to handle the challenges associated with ensuring ACID transactions in various storage environments (distributed and cloud-based storage systems). Key Functions of LogStore API

  • File Operations: The LogStore API provides methods for reading, writing, and listing files in a way that ensures Delta Lake’s ACID guarantees are upheld, even on storage systems with eventual consistency or other limitations.
  • Custom Implementations: Delta Lake provides default implementations of the LogStore API for common storage systems (e.g., HDFSLogStore, AzureLogStore, S3LogStore), but users can also create custom implementations if they need to support other storage systems or handle specific requirements.

Do we need to implement LogStore API for Delta?

The LogStore API comes built-in with the Delta Lake distribution. You don’t need to implement it yourself; it is provided by Delta Lake where native file operations might not guarantee ACID properties.

The LogStore API abstracts the underlying file system operations, providing the necessary mechanisms to ensure that Delta Lake can perform transactions reliably, whether on cloud storage systems like S3, Azure Blob Storage, or Google Cloud Storage, or on local file systems.

Iceberg X ACID guarantees

Iceberg Catalog — This catalog keeps track of where all metadata files for a table are located and provides an entry point to access and manage the table. Its a global namespace, ensuring that all clients see a consistent view of the tables. This can be implemented using various systems like Hive Metastore, AWS Glue, or a file-based system

Metadata layer — Its the central to iceberg’s ability to manage data efficiently and consists of several components

  • Metadata file — This is the top-level file that describes the structure and state of an Iceberg table. It tracks everything about the table, including its schema, partitioning, snapshots
  • Manifest list — Is a metadata file that references multiple manifest files within a snapshot.
  • Manifest file — Are metadata files that list the individual data files in a snapshot. They act as the bridge between the high-level table metadata and the actual data files

Data layer — This is where the actual data resides in the form of immutable files. The data layer consists of the physical files that store the table’s data. Data is stored in formats like parquet, avro or ORC. Each data file is referenced by a manifest file, ensuring that the metadata layer knows exactly where each piece of data is located

How do these layers work together — When a query or operation is performed on an Iceberg table, the system interacts with the catalog to locate the relevant metadata files, traverses the metadata layer to understand the structure and state of the table, and then reads or writes to the data layer as needed. This architecture allows Iceberg to efficiently manage large datasets with ACID guarantees, support time travel, and perform complex operations like schema and partition evolution.

Does Iceberg always need an Iceberg catalog?

Yes, Iceberg always needs a catalog. Iceberg’s catalog does not strictly need to be a database. It can be a local file system that behaves as a catalog, storing metadata files and managing table versions directly on the file system.

File based catalog — Ideal for simpler setups, such as single-node environments or when using Iceberg on a local machine for development purposes. It’s also useful in environments where you don’t want to set up or rely on an external database for catalog management

Database catalog (e.g., Hive Metastore, AWS Glue, JDBC-based Catalogs) — Suitable for environments where tables are shared across multiple users or systems, providing centralized management and access controls.

References

  1. Delta — https://docs.delta.io/latest/delta-storage.html
  2. Iceberg — https://docs.delta.io/latest/delta-storage.html

--

--