Lakehouse statistics and why query engines get lost

Statistical metadata is to SQL Engines as maps are to delivery drivers. Without maps, drivers get lost; without statistical metadata, SQL Engines do things in the wrong order and also get lost. This happens rather more often than we'd like; read on to learn why!

Metadata

There are many types of metadata at work in a data lakehouse which I'll group into two categories: Structural and statistical. Structural metadata tells us what we've got: The data schema and its evolution, partitioning, versioning, snapshots, how to find files, that sort of thing. The specs of the catalogues deal really well with structural metadata, although the encodings of the manifests within are pretty verbose.

Statistical metadata describes characteristics of the data within the files: Statistics on the data itself used to plan queries in the optimal way along with indexing information needed to locate relevant data as fast as possible. Little is written about statistical metadata beyond file and partition pruning, and both the catalogue and Parquet format specifications are weak in this area. This article talks about some of the problems.

What? Statistical metadata?

Sometimes referred to as "column statistics," this type of metadata tells us all about the data within a table. In the current Parquet and Iceberg specifications, statistical metadata isn't standardised particularly well and a bunch of it is optional. There are a few different types, available at various resolutions:

Simple counts

Within a column of data within a file, planners and executors need to know at least the number of null values present, the size of the data, and the number of records present. You might want to know the average and maximum lengths of data within the column to estimate memory usage of operators like aggregates and joins.

It's also necessary to know the cardinality of a column to order joins correctly, as well as to size aggregates. The size of the hash table in the join and aggregate is calculated based on the number of distinct values. If this information is missing – and unfortunately, it's often unpopulated in Iceberg – the planner has no way to order joins correctly and just has to guess based on row counts, leading to wrongly-shaped query plans or incorrect selection of join strategy (merge vs hash, distribute vs broadcast, etc).

Sketches - probabilistic structures

Sketches are like a low-resolution version of a photo: You can preview what's there, understand the colour scheme and scene, identify the objects in the photo, but all of the detail is missing. Ideally, the sketch acts as a substitute for "all of the data," allowing you to compute the approximate outcome of operations on the data while processing 10s of KB of data instead of terabytes. Here's some examples of useful sketches:

HyperLogLog (HLL): A sketch that tells us the approximate cardinality of the data i.e. how many unique values are present. Used to plan joins correctly and size aggregates.
Theta: This is a cunning data structure that lets us perform set operations on the data - union, intersection etc - and get an approximate resulting count without needing all of the data. This sketch lets us estimate row counts coming out of multiple joins and set operations more accurately. It's often preferred to HLL due to ease of merging.
KLL: A quantile sketch that lets us see the distribution of data without needing to look at all of the rows. This is a more advanced version of a traditional database histogram that can be incrementally merged. It can approximately answer questions like what's the median of the data, or how many values are within a certain range, to estimate predicate selectivity.
T-Digest: A quantile sketch that's similar to KLL but weighted towards accuracy at extremes (e.g. the 99th percentile) than across the whole range.
Frequent Items aka Most Common Values: This lists the most common values and counts in a column of data and is necessary to correctly handle skew in data. Skewed data sets can easily cause wrong algorithm choices and mess up parallel execution. There are a couple of algorithms for doing this - a sample scan to make a list of MCVs, like PostgreSQL, or a Frequent Items Sketch that can be incrementally updated on an ongoing basis.
Bloom filter: A bloom filter allows us to definitively state that a data item isn't present in a set of data. It might have false positives but doesn't have false negatives. This lets us skip irrelevant rows of data very efficiently.

There are many more probabilistic structures being created, such as SuRF (succinct range filter - works like a bloom filter, but for ranges of data) and prefix bloom filters (useful for URLs).

Most of these probabilistic data structures are used extensively in planning queries to determine the optimal shape of the operator tree by allowing fairly accurate estimations of actual row counts. Some are used by storage engines to skip processing data that's not of interest to a given SQL query. Some can also be exposed through SQL to allow "approximate" answers to queries in far less time than it would take to crunch all of the data.

Range (min/max) values

Range values tell us, within a given column and row range, what's the smallest and largest value present. Range values are primarily used for skipping data in table scans. The range data can either be very coarse – stored at just a file level, for example – or very granular, stored for each small disc block. Sometimes range data is augmented with a simple bitmap to store which values in between a range are actually present.

For example, if we know we're scanning a set of files for people where age=40, and a subset of the files contain an age range under 21 or > over 65, they be skipped by the storage engine. Bloom filters also allow us to skip files, as some of the more advanced probabilistic structures mentioned above.

Data partitioning, and advanced clustering such as Databricks Liquid Clustering, try to group data such that data with similar ranges are kept close to one another to enable more efficient data skipping.

Range data can be stored at varying resolution. The Iceberg specification, for example, stores it in the catalogue for whole files, and even that's optional; more granular data, at the row group or page level, is optional in the underlying Parquet files themselves. If a whole file contains age data for people aged 22-64, we have to read and process all of the column data, even though only only 3% it might be relevant. It would be far nicer if we could just seek straight to that 3%.

What metadata can we rely on being present?

The answer is, not much unfortunately. Pretty much every single count or statistic mentioned here is optional in either Iceberg or Delta Lake or both!

When a query planner has no statistics to count on, it falls back to 'guessing' and will produce wrongly shaped plans, causing long execution times, excessive memory usage and spilling. Some queries may never finish. This is the biggest issue with lakehouse query engines today.

Likewise when a storage engine has no statistics to count on, it will read and throw away far too much data, increasing cost, increasing runtimes, slowing execution and reading data in the wrong order.

This happens often in the Lakehouse world because some of the simple counts might be missing, all of the sketches might be missing, and the range values may only exist at a whole file level. Some statistical metadata is optional at the Catalogue level, and some is optional at the Parquet-file level. Between Databricks and Iceberg, these different types of statistics are stored in completely different places. Iceberg has "Puffin files," which contain serialized binary blobs that be missing or incompatible between engines. In Databricks, there's the Delta log and API-only access to the Unity catalogue.

It's a mess. At Floe, we are working on Floecat and (yet to be announced) FloeScan, which clean up the mess and guarantee to always provide the accurate statistical metadata that query planners and execution engines count on while remaining compatible with open standards.

In future blogs, I will explain how we're fixing this for both Delta Lake and Iceberg.