Can Hive Metastore ride the wave of table formats and data modernization?

Suteja Kanuri
3 min readSep 4, 2024

--

Hive Metastore has remained an important aspect of the Hadoop ecosystem. Since the beginning, Hive Metastore held the promise of an un-opinionated catalog for the Hadoop ecosystem.

  • Purpose: HMS is a centralized metadata repository used primarily with Hadoop-based data processing frameworks. It stores metadata about data structures, such as table schemas, partitions, and data location.
  • Architecture: HMS operates as a standalone service and uses a traditional relational database (like MySQL or PostgreSQL) to store metadata. It supports a variety of data processing tools like Apache Hive, Apache Spark, and Apache Impala.

What components of Hadoop got replaced?

  • MapReduce → Spark: Spark offers faster, in-memory processing, making it the preferred choice over the traditional MapReduce framework.
  • HDFS → Cloud Storage (Amazon S3, Google Cloud Storage, Azure Blob Storage): Cloud storage solutions provide scalable, flexible, and cost-effective alternatives to HDFS, with easier integration into modern data architectures.
  • Pig → Spark: Spark has also taken over Pig’s role, offering more powerful and flexible data processing capabilities.
  • Oozie → Apache Airflow and Modern Orchestration Tools: Workflow orchestration has shifted to more flexible and user-friendly tools like Apache Airflow, which provide better integration with diverse data ecosystems.
  • YARN → Kubernetes: Kubernetes is used for resource management and orchestration, offering greater flexibility and support for containerized workloads.
  • Hive Query Engine → Presto/Trino: The query engine component of Hive has been outpaced by Presto and Trino, which are now favored for their superior performance and scalability in distributed SQL query execution.
  • and many more…

Why is Hive Metastore widely compatible

HMS becomes accessible through a standard API provided by Apache Thrift, which abstracts the complexity of backend implementations(like MySQL, PostgreSQL, Oracle, Microsoft SQL Server, Apache Derby etc), transaction management systems, security mechanisms, and caching strategies. With Thrift, you simply interact with a unified interface and it takes care of the above.

How does Hive Metastore work in cloud

Cloud providers often offer their own metadata management services (e.g., AWS Glue Data Catalog, Google Cloud Data Catalog, Azure Data Catalog) that can complement or replace Hive Metastore. They provide a serverless HMS option, where the entire infrastructure, including versioning, scaling, and maintenance, is managed for you. These services expose HMS through Thrift endpoints, which are managed and maintained by the cloud provider. This means that you can focus on your data and analytics without worrying about the underlying HMS infrastructure

Issues with Hive Metastore

  • Maintenance Overhead: Regular maintenance tasks, such as backups, upgrades, and tuning of the relational database, are required to keep HMS running efficiently.
  • Integration and compatibility issues: Not architected for being cloud-native, complicating managed service implementations.
  • Single Point of Failure: HMS can become a bottleneck or single point of failure if not properly scaled or managed, potentially affecting the entire data processing ecosyste
  • Database Performance: The performance of HMS is heavily dependent on the underlying relational database. If the database is not properly tuned or scaled, it can lead to performance issues in metadata retrieval.

What kept Hive Metastore not replaceable till now?

Its one last standing component from the glorious hadoop ecosystem which remained intact and irreplaceable. For modern data platforms that choose to be open, it is not necessarily the data that has to be open but rather the interface it exposes

Many tools and frameworks like Apache Hive, Apache Spark, Presto/Trino in the big data ecosystem were built with Hive Metastore compatibility in mind. This historical integration created a strong dependency that made transitioning to alternative solutions complex. Hive metastore has a very long history and it was the de-facto default when these tools got started

Is Hive Metastore open — in what sense open?

The term “open metastore” generally refers to a metadata management service or repository that supports open standards and is designed to be interoperable with various data processing systems. While Hive Metastore is open-source, it is primarily designed for integration with the Hadoop ecosystem and may not fully align with all aspects of an open metastore as defined by newer standards

APIs for Integration: Hive Metastore exposes a set of open APIs (such as Thrift-based APIs) that enable interaction with its metadata repository. These APIs are used by various tools and applications to access and manage metadata stored in HMS

Support for Various Formats: HMS is designed to work with a range of data formats and storage systems, including both traditional formats like Parquet and ORC and newer open table formats like Apache Iceberg and Delta Lake. This support for multiple formats demonstrates its openness in handling diverse data types

--

--