The volumes of data required for Machine Learning projects are continuously growing. Data scientists and Data engineers need the ability to access huge amounts of data in a timely manner.
In order to build data platforms that can handle massive quantities of data ingestion and processing, a number of technologies have emerged. This article will focus on Apache Iceberg and how it solves many of the challenges faced by its predecessors, but let’s start with the evolution of Data systems.
First, they were databases…
A database stores real-time information and can run simple queries very quickly.
In general, a database stores data related to a certain part of your business, for example if you have a microservices architecture where each service has its own database.
The ability to scale a database is limited, because you scale the computing power , RAM and disks at the same time.
If you are running a DB cluster on X machines and you need to scale the cluster, you will need to add additional machines with RAM, disks and CPUs, meaning that regardless of the bottleneck type, you will have to scale all parameters at once, adding unnecessary costs.
Data warehouse
Data warehouses are very similar to Databases in terms of architecture and how they scale, but they were built to handle much bigger data volumes.
The key difference between regular databases and Data warehouses is that Data warehouses were built to be a centralized data repository that pulls together data from different sources and exposes them for reporting and analytics. They are not updated in real time and contain historical data with longer SLAs.
Some of the challenges presented by the huge quantities of data were similar to the challenges solved by traditional databases, but further scaling databases or data warehouses was inefficient and expensive. Something new had to be done.
Separating Storage and compute
Separating the compute layer from the storage layer created a way to isolate compute costs, meaning customers only paid for the processing power needed for the queries thereby further lowering costs. Storage and compute separate scaling allowed for elastic tuning of resources bringing flexibility to the emerging data lake architectures.
And then there was the Data Lake…
A Data Lake is a central repository of unstructured data from different sources that can be accessed to create insights and analytics, similar to a data warehouse.
The key difference is that Data Lake architectures can scale to hundreds of terabytes with acceptable read/write speeds.
In order to scale the storage and the compute power separately, solutions like Hadoop and Hive were developed among others.
Hadoop was first developed by Yahoo in 2005. Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
In 2009 Facebook understood that while Hadoop solved many of their requirements, it also had issues that needed improvement, for example:
- In order to query data in Hadoop, a user has to write a MapReduce algorithm in java.
- There was no data metadata or schema.
So, they built Hive.
Apache Hive started as an SQL-on-Hadoop tool and then evolved as a separate framework when it started supporting object stores such as S3.
Hive is a data warehouse infrastructure that queries data stored on HDFS or S3 using an SQL-like language.
What is Apache Hive?
In Hive you organize data in a directory tree, defining a table format by mapping a table or part of a table (a partition) to a physical directory.
This mapping is stored in the Hive metastore. You can define the Hive metastore in almost any sql database such as: MYSQL, MariaDB, Postgres, etc.
In the real world, the number of files and directories can be enormous. Scanning all these files while doing a search can have serious performance implications. This is why Hive supports Partitions and Buckets.
Partitioning is a way of dividing a table into related parts based on the values of particular columns. It means dividing the table into parts stored separately, based on the values of a particular column. The need for partitioning arises from the huge volumes of data stored in too many files. While storing data in partitions (different directories and files), when a query is made, the query engine will scan only the relevant directory, thus scanning much less files.
For example if we partition the “date” column, we will have one directory per date

So when querying something of the form:
SELECT * from table where date = 2022-02-11
The query engine will only open the files located in the directory

Bucketing in hive is separating data into ranges in a specific column.
For example: bucketing a customer id column should use a hash function that will map any customer id to a number in the range (the number of buckets).
Bucketing allows more efficient queries by limiting the location of the data to the bucket it was assigned thus shortening the search.
Other advantages of Hive include being file-format agnostic, supporting file formats developed by others that were better for analytics such as Parquet and ORC.
Hive has been the de facto standard for the past 12 or more years, since it is supported by almost every query engine.
But Hive has some disadvantages
- Hive does not manage a file list, so directory listings operations against object stores such as S3 take too long, creating performance problems.
- Table state is stored in 2 places: partitions in the metastore and files in a file system.
- Using transactions to store the partitions in a database but storing the files in a file system, creates issues when updating or deleting data. Moving or adding files to multiple partitions cannot be handled in a transactional manner, creating consistency problems.
Big tech companies handling big data volumes were struggling, creating too many workarounds to solve Hive’s problems.
Netflix engineers stepped up in trying to solve the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments by creating and then open-sourcing Apache Iceberg in 2018. They understood that most of Hive’s issues were due to tracking the directories and not the files.
In order to solve the slow file listings and the unified table state, they defined a table using a canonical list of files.
What is Apache Iceberg? (**)
Iceberg is a high-performance format for huge analytics tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, and Hive to safely work with the same tables, at the same time. Main features are:
- Expressive SQL
Iceberg supports flexible SQL commands to merge new data, update existing rows, and perform targeted deletes. Iceberg can eagerly rewrite data files for read performance, or it can use delete deltas for faster updates.

- Schema evolution supports add, drop, update, or rename, and has no side-effects.
Schema evolution just works. Adding a column won’t bring back “zombie” data. Columns can be renamed and reordered. Best of all, schema changes never require rewriting your table.

- Hidden partitioning prevents user mistakes that cause silently incorrect results or extremely slow queries. Iceberg handles the tedious and error-prone task of producing partition values for rows in a table and skips unnecessary partitions and files automatically. No extra filters are needed for fast queries, and table layout can be updated as data or queries change.
- Partition layout evolution can update the layout of a table as data volume or query patterns change
- Time travel enables reproducible queries that use exactly the same table snapshot, or lets users easily examine changes. Version rollback allows users to quickly correct problems by resetting tables to a good state

- Scan planning is fast
- A distributed SQL engine isn’t needed to read a table or find files
- Advanced filtering – data files are pruned with partition and column-level stats, using table metadata
- Data compaction is supported out-of-the-box and you can choose from different rewrite strategies such as bin-packing or sorting to optimize file layout and size.
Compaction is an asynchronous background process that compacts small files into fewer larger files. Since it’s asynchronous and uses the snapshot mechanism, it has no negative impact on your users.
Compaction helps balance low latency on write and high throughput on read.
When writing, we want the data available as soon as possible so we write as fast as we can, creating small files.This creates the overhead of opening too many files.
After compaction we get bigger files that have more records per file, improving read performance.
**source: Apache Iceberg’s website
Data Lake architecture with Iceberg, Presto and Hive
Apache Iceberg Architecture
The table format tracks individual data files in a table instead of directories. This allows writers to create data files in-place and only adds files to the table in an explicit commit.
The Iceberg Catalog
- When a query is made, the query engine accesses the Iceberg catalog, retrieves the location of the current metadata file for the table, then opens the file.
- Within the catalog, there is a reference or pointer for each table to that table’s current metadata file.
Metadata layer
Metadata files
- Maintain the table state
- All changes to table state create a new metadata file and replace the old metadata with an atomic swap.
- Tracks the table schema, partitioning config, custom properties, and snapshots of the table contents.
Snapshots
- A complete list of files in a table
- Represent the state of a table at some time and are used to access the complete set of data files in the table.
- Each write creates and commits a new snapshot
- Snapshots create isolation without locking
- Writers produce a new snapshot then commit it. Snapshots can be rolled back.
- The data in a snapshot is the union of all files in its manifests
Manifest files
- Snapshots are split across many manifest files
- Contain
- Files location and format
- Values used for filtering: partition values, lower and upper bounds per file
- Metrics for cost based optimization
- File stats
Manifest list
- Is a list of manifest files.
- Has information about each manifest file that makes up that snapshot
Data layer
Comprised of the actual files (parquet, orc) stored in a physical layer (S3, HDFS)
Iceberg’s advantages
- Guaranteed isolated reads and writes, also providing concurrent writes.
- By using snapshots, Iceberg enables time-travel operations, allowing users to query different versions of the data and reverting to a past version if needed.
- All writes are atomic.
- Faster planning and execution, since instead of listing the partitions in a table during job planning, it performs a single read of the snapshot.
- Schema changes are free of side-effects. Fields have unique identifiers mapped to their names, so you can change the field names without side effects.
- Partitions and partition granularity can be changed.
Conclusions
We have reviewed Apache Iceberg, you can research and review other players in this area such as: Apache Hudi, Delta Lake that solve these issues in a similar way.
Fast queries at petabytes scale requires modern data architectures, Iceberg allows organizations to move into the future by graciously solving many of the issues we encounter at a huge data scale.
Thanks for reading!

















