Apache XTable

4 min readJun 23, 2024

What is Apache XTable™

Apache XTable™ (Incubating) is a cross-table converter for table formats that facilitates omni-directional interoperability across data processing systems and query engines.

In simple words, XTable™ is not a table format like delta, hudi, iceberg. Instead, it is a sync operation that creates only metadata for specified table formats.

Essentially, if you run XTable™ sync operation, a metadata for either iceberg, hudi or delta is created.

Why is Apache XTable™ gaining traction

As the war of table formats is underway and there is no clear winner yet, XTable™ and such technologies/ tools will be a huge game changer for the end users to alternate between the table formats (delta, hudi, iceberg).

What are the advantages of Apache XTable™

Apache XTable™ will aid in interoperability between the data formats by making metadata available in different formats so that readers can use their preferred reading mechanisms (delta/iceberg/hudi).

As Apache Iceberg is appearing as the winner in the table formats war on the internet, we still need to account for existing hudi and delta formats as they have their own advantages and niche use cases where they work extremely well.

Hudi — Focuses on incremental processing, file indexing, and clustering. Best for use cases involving frequent updates and real-time data ingestion

Delta — Strong in data skipping, Z-ordering, and compaction. Suitable for scenarios requiring efficient query performance and storage optimization with comprehensive ACID transaction support

Iceberg — Excels in partition evolution, hidden partitioning, and detailed metadata management. Ideal for environments needing flexible partition strategies and efficient query execution

Caveats to using Apache XTable™

It only creates metadata and does not do any optimizations out of the box for the formats

After writing the data in the required format, an additional sync command must be run after every commit to the table so that all three metadata are in sync i.e. it should not be like iceberg readers sees one version and delta readers should see another.

Metadata synchronizations must be explicitly performed

Detailed Explanation

If I am using Databricks to write a delta table and want to read it in iceberg format, I can do that with Apache XTable™ by running the sync operation provided by Apache XTable™ on the delta table. This would create the necessary metadata for iceberg format data. The downstream compute engine can read the delta table as iceberg i.e read the parquet data with the iceberg metadata.

Essentially delta, iceberg and hudi are all parquet under the hood. The only difference amongst them is their metadata and the file layout optimizations they provide. For example delta has auto compactions optimizations and liquid clustering , or change data feed capabilities which others lack

What are the competing solutions to Apache XTable™

>Only one for now which is Delta UniForm by Databricks

Delta UniForm works by automatically generating the metadata as compared to XTable™ where metadata synchronizations must be explicitly performed. This has been open sources.

Caveat to using Delta UniForm

Uniform writes in the Delta table format only for now
Uniform cannot write as Iceberg/Hudi format yet. Uniform’s base format is Delta, Iceberg and Hudi metadata are additionally generated

Advantages of using Delta UniForm

No need for explicit synchronizations like XTable™ since metadata for other formats will be automatically updated after every data commit.

Links that I referred for the POC

Apache XTable™ git repository — https://github.com/apache/incubator-xtable
Apache XTable™ how-to — https://xtable.apache.org/docs/how-to
Instructions to Install spark on mac local — https://sparkbyexamples.com/pyspark/how-to-install-pyspark-on-mac/
Maven repository link — https://mvnrepository.com/

Setting XTable™ details

Used spark on my local to write to local and GCS
Used spark on my local to read back the data from local and GCS

Steps to follow to setup XTable™ on your local

Download the bundled jar of Apache XTable™ from Link #1
Build the jar using java 11 environment (and not any other latest version). Use jenv to build it so that you can move between various java versions. Link #1 above has details about jenv
Download spark on your local using link #3
While opening the spark shell, make sure to refer to link #2 or below to incorporate the right spark configs for each of hudi, delta and iceberg

Hudi ~~ for local
pyspark \
— packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.14.0 \
— conf “spark.serializer=org.apache.spark.serializer.KryoSerializer” \
— conf “spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog” \
— conf “spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension”
Delta ~~ for local
pyspark \
— packages io.delta:delta-core_2.12:2.1.0 \
— conf “spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension” \
— conf “spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog”
Iceberg ~~ for local
pyspark \
— packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.4.1 \
— conf “spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions” \
— conf “spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog”
Delta and iceberg ~~ For writing to GCS
pyspark \
— packages io.delta:delta-core_2.12:2.1.0,org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.4.1 \
— jars https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar \
— conf “spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension” \
— conf “spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog” \
— conf “spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions” \
— conf “spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog” \
— conf “spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem”

5. Run simple commands to create a dataset and save as hudi, iceberg or delta using the link #2

6. Run simple commands to create a dataset and save as hudi, iceberg or delta table using the link #2

Code to save the data to GCS
df.write.format(“iceberg”).partitionBy(“city”).save(“gs://bucket_name/iceberg/users”)

7. Running sync using the commands from link #2 (this is where I ran into issues, as mentioned below)

Caveat to setting and configuring XTable™

Upon running the command sync for hudi specifically, it took a lot of time and program would not exit. This did not happen with iceberg and delta format for some reason
XTable™ runs only on Java11 for now, this needs to be taken into consideration
Python version needs to be 3.9 and not any latest version

With all the above checks XTable™ works. If would be good to use a managed version on XTable™ so that one doesnt need to keep a check manually on the maven packahe versions/java versions/python versions.

Written by Suteja Kanuri