Google’s lakehouse offering, BigLake: A deep dive into various BigLake tables

6 min readFeb 11, 2025

Introduction

For years, BigQuery has been one of the most powerful cloud data warehouses, offering unmatched scalability and performance. However, one major challenge has always loomed over its adoption — vendor lock-in. Data stored in BigQuery’s proprietary storage format required users to stay within the BigQuery ecosystem, limiting flexibility and multi-cloud interoperability.

Now, Google has made a strategic shift with BigLake, an offering that blends the best of both data lakes and warehouses while breaking down silos. But how far does it go in addressing lock-in concerns? Let’s dive in.

BigQuery’s Lock-In Challenge

While BigQuery has been popular for analytics workloads, its reliance on a proprietary storage engine has posed several challenges:

Storage and Compute Coupling: Unlike open lakehouse architectures, BigQuery tightly integrates storage and compute.
Limited Multi-Engine Support: Data in BigQuery-native tables could not be efficiently queried by other processing engines like Spark, Trino, or Presto.
Governance Constraints: While BigQuery offers row- and column-level access control, it’s tied to the Google ecosystem.
Cost and Egress Fees: Moving data out of BigQuery incurs additional costs, discouraging cross-cloud flexibility.

With BigLake, Google aims to solve these issues by supporting open table formats and multi-engine processing.

In this article, we’ll explore BigLake in the context of the Google Cloud ecosystem and break it down into five key areas:

BigQuery Native Tables
BigQuery External Tables
BigLake Iceberg Tables via BigLake Metastore (BLMS)
BigLake Managed Tables (BLMT)
BigLake Self-Managed External Tables

We’ll dive into these components, compare their use cases, and explore how BigLake integrates with Google Cloud services to provide a streamlined, unified data management approach.

1. BigQuery Tables

BigQuery is Google Cloud’s fully managed data warehouse that enables fast SQL queries on large datasets. Within BigQuery, there are two primary table types to consider:

Native BigQuery Tables (BQ Tables)

These tables are fully managed within the BigQuery ecosystem, and the data resides entirely within BigQuery’s native storage layer. The data is highly optimized for analytics, and BigQuery handles metadata, schema management, and performance optimizations like partitioning and clustering automatically.

Advantages of Native BigQuery Tables:

Fully Managed: BigQuery takes care of the underlying storage and metadata management.
High Performance: Optimized for complex analytics and large datasets.
Scalable: Handles large data volumes with ease using BigQuery’s scalable architecture.

External BigQuery Tables

External tables allow BigQuery to access data stored outside of BigQuery’s storage but still within Google Cloud, such as in Google Cloud Storage (GCS), Amazon S3, or Azure Data Lake. This is useful when the data resides in external storage but needs to be queried or processed without moving it into BigQuery.

Advantages of External BigQuery Tables:

No Data Duplication: Data stays in its external location, reducing costs for data transfer.
Flexibility: Allows for querying data from diverse sources directly within BigQuery.
Lower Cost: Avoids the cost of moving data into BigQuery storage, making it cost-effective for certain use cases.

2. BigLake tables

[Variant 1] BigLake Self-Managed External Tables

While BigLake Managed Tables and Iceberg tables offer powerful managed solutions, Self-Managed External Tables allow users to keep more control over their data by using external storage solutions like Google Cloud Storage (GCS), Amazon S3, or Azure Data Lake. With self-managed external tables, users can define and manage their metadata externally, leveraging custom catalogs or systems like Apache Hive or Apache Iceberg.

Advantages of Self-Managed External Tables:

Full Control: Users have complete control over their metadata and data storage, making it more flexible for certain use cases.
Integration with External Tools: Work with external engines such as Apache Spark, Flink, or Trino for custom workflows.
Cost-Efficiency: External data can be queried directly without moving it into BigQuery’s storage, saving costs for large datasets.

Use Case: Best for organizations that have large datasets already stored externally and need to maintain their metadata separately while still querying it efficiently in BigQuery.

[Variant 2] BigLake Iceberg Tables via BigLake Metastore (BLMS)

BigLake Iceberg Tables via BigLake Metastore (BLMS) is a specialized table variant designed for organizations leveraging Apache Iceberg as their open-source table format. Iceberg provides advanced features such as schema evolution, ACID transactions, and support for partitioning data in a flexible way.

With BLMS, these Iceberg tables are managed through the BigLake Metastore, allowing them to be accessed seamlessly across different engines like BigQuery and Apache Spark.

Advantages of BLMS:

Seamless Integration: Access Iceberg tables directly from BigQuery, using the BigLake Metastore to manage metadata.
Cross-Engine Compatibility: Easily integrate with other engines like Apache Spark, Trino, or Flink.
No Metadata Refresh: Iceberg tables under BigLake do not require manual metadata refresh, as metadata is automatically handled by BigLake.
Unified Governance: By centralizing metadata management, BigLake ensures consistent access and governance across multiple systems.

Use Case: Ideal for organizations that need to manage large, open-source, and evolving datasets while integrating with both cloud and on-prem systems.

[Variant 3] BigLake Managed Tables (BLMT)

BigLake Managed Tables (BLMT) are fully managed Iceberg tables that reside on Google Cloud Storage (GCS) but are fully integrated with BigQuery. Unlike Iceberg tables managed externally, BLMT offers full CRUD support (Create, Read, Update, Delete) directly through BigQuery, simplifying the data management process.

BigLake Managed Tables provide users with a seamless experience for managing their data without worrying about underlying storage or metadata management.

Advantages of BLMT:

Fully Managed: BigQuery automatically manages both the data and the metadata, making it easier to work with compared to self-managed solutions.
Seamless Integration: CRUD operations on Iceberg tables can be done directly through BigQuery without needing external systems or services.
Optimized for Analytics: Since the data is fully integrated into BigQuery, it can take advantage of all BigQuery’s performance optimizations like clustering, partitioning, and security features.

Use Case: Best suited for enterprises that want the flexibility of Apache Iceberg while enjoying the benefits of BigQuery’s management and analytics capabilities

Fragmentation of metastores accross table variants

The fragmentation of the metastore across different table variants like BigLake Managed Tables (BLMT), BigLake Metastore Tables (BLMS), and BigQuery external tables can lead to challenges in metadata consistency and unified management. Each table type has its own metadata structure and storage model, which can create silos and hinder seamless integration.

Introducing BigQuery Metastore

However, using the BigQuery Metastore service can help address this fragmentation by providing a central repository for metadata management, improving accessibility and reducing the complexity of managing multiple engines and storage types.

BigQuery metastore is designed for the lakehouse architecture, which combines the benefits of data lakes and data warehouses without having to manage both a data lake and a data warehouse — any data, any user, any workload, on a unified platform. It supports open data formats such as Apache Iceberg that are accessible by a variety of processing engines, including BigQuery, Spark, Flink and Hive. The unification of metadata across engines makes it easier to discover and use data, supporting self-service BI and ML tools to drive innovation, while maintaining data governance.

Furthermore, BigQuery metastore is serverless with no setup or configuration required and automatically scales with your workloads. This no-ops environment reduces TCO and democratizes your data for data analysts, data engineers and data scientists.

BigQuery Metastore vs. BigLake Metastore vs. Dataplex: Key Differences & Roles

When to Use What?

✅ Use BigQuery Metastore if you only need metadata, governance, and security for BigQuery & BigLake tables.

✅ BigLake Metastore: Manages metadata for tables stored in external storage systems, like GCS

✅ Use Dataplex if you need centralized governance and metadata for a hybrid/multicloud environment, managing data across BigQuery, GCS, and external lakes.

How They Work Together

BigQuery Metastore and BigLake Metastore handle metadata for their respective storage systems. Dataplex acts as the centralized data governance layer, unifying metadata across BigQuery Managed Tables, BigQuery Native Tables, and BigLake Tables. It ensures that data access policies, security controls, and lineage are consistent across all platforms.

Confusion with multiple metastores

The issue arises when organizations must navigate multiple metastores, each with different access controls, governance policies, and storage systems. The results can be confusion, inefficiency, and potential for mismanagement if these metastores are not properly synchronized.

Conclusion

BigLake is still evolving, and as a result, the different table variants (BLMT, BLMS, and external BigQuery tables) can be confusing. These variants cater to different use cases, but they lack a unified structure and clarity, especially in terms of metadata management. The product will likely benefit from further refinement and consolidation of its table variants to make it more user-friendly and efficient in the long term. Unifying the metadata model would streamline operations and reduce fragmentation across these table types.