Statistical metadata is to SQL Engines as maps are to delivery drivers. Without maps, drivers get lost; without statistical metadata, SQL Engines do things in the wrong order and also get lost. This happens rather more often than we'd like; read on to learn why!
There are many types of metadata at work in a data lakehouse which I'll group into two categories: Structural and statistical. Structural metadata tells us what we've got: The data schema and its evolution, partitioning, versioning, snapshots, how to find files, that sort of thing. The specs of the catalogues deal really well with structural metadata, although the encodings of the manifests within are pretty verbose.
Statistical metadata describes characteristics of the data within the files: Statistics on the data itself used to plan queries in the optimal way along with indexing information needed to locate relevant data as fast as possible. Little is written about statistical metadata beyond file and partition pruning, and both the catalogue and Parquet format specifications are weak in this area. This article talks about some of the problems.
Sometimes referred to as "column statistics," this type of metadata tells us all about the data within a table. In the current Parquet and Iceberg specifications, statistical metadata isn't standardised particularly well and a bunch of it is optional. There are a few different types, available at various resolutions:
Within a column of data within a file, planners and executors need to know at least the number of null values present, the size of the data, and the number of records present. You might want to know the average and maximum lengths of data within the column to estimate memory usage of operators like aggregates and joins.
It's also necessary to know the cardinality of a column to order joins correctly, as well as to size aggregates. The size of the hash table in the join and aggregate is calculated based on the number of distinct values. If this information is missing – and unfortunately, it's often unpopulated in Iceberg – the planner has no way to order joins correctly and just has to guess based on row counts, leading to wrongly-shaped query plans or incorrect selection of join strategy (merge vs hash, distribute vs broadcast, etc).
Sketches are like a low-resolution version of a photo: You can preview what's there, understand the colour scheme and scene, identify the objects in the photo, but all of the detail is missing. Ideally, the sketch acts as a substitute for "all of the data," allowing you to compute the approximate outcome of operations on the data while processing 10s of KB of data instead of terabytes. Here's some examples of useful sketches:
There are many more probabilistic structures being created, such as SuRF (succinct range filter - works like a bloom filter, but for ranges of data) and prefix bloom filters (useful for URLs).
Most of these probabilistic data structures are used extensively in planning queries to determine the optimal shape of the operator tree by allowing fairly accurate estimations of actual row counts. Some are used by storage engines to skip processing data that's not of interest to a given SQL query. Some can also be exposed through SQL to allow "approximate" answers to queries in far less time than it would take to crunch all of the data.
Range values tell us, within a given column and row range, what's the smallest and largest value present. Range values are primarily used for skipping data in table scans. The range data can either be very coarse – stored at just a file level, for example – or very granular, stored for each small disc block. Sometimes range data is augmented with a simple bitmap to store which values in between a range are actually present.
For example, if we know we're scanning a set of files for people where age=40, and a subset of the files contain an age range under 21 or > over 65, they be skipped by the storage engine. Bloom filters also allow us to skip files, as some of the more advanced probabilistic structures mentioned above.
Data partitioning, and advanced clustering such as Databricks Liquid Clustering, try to group data such that data with similar ranges are kept close to one another to enable more efficient data skipping.
Range data can be stored at varying resolution. The Iceberg specification, for example, stores it in the catalogue for whole files, and even that's optional; more granular data, at the row group or page level, is optional in the underlying Parquet files themselves. If a whole file contains age data for people aged 22-64, we have to read and process all of the column data, even though only only 3% it might be relevant. It would be far nicer if we could just seek straight to that 3%.
The answer is, not much unfortunately. Pretty much every single count or statistic mentioned here is optional in either Iceberg or Delta Lake or both!
When a query planner has no statistics to count on, it falls back to 'guessing' and will produce wrongly shaped plans, causing long execution times, excessive memory usage and spilling. Some queries may never finish. This is the biggest issue with lakehouse query engines today.
Likewise when a storage engine has no statistics to count on, it will read and throw away far too much data, increasing cost, increasing runtimes, slowing execution and reading data in the wrong order.
This happens often in the Lakehouse world because some of the simple counts might be missing, all of the sketches might be missing, and the range values may only exist at a whole file level. Some statistical metadata is optional at the Catalogue level, and some is optional at the Parquet-file level. Between Databricks and Iceberg, these different types of statistics are stored in completely different places. Iceberg has "Puffin files," which contain serialized binary blobs that be missing or incompatible between engines. In Databricks, there's the Delta log and API-only access to the Unity catalogue.
It's a mess. At Floe, we are working on Floecat and (yet to be announced) FloeScan, which clean up the mess and guarantee to always provide the accurate statistical metadata that query planners and execution engines count on while remaining compatible with open standards.
In future blogs, I will explain how we're fixing this for both Delta Lake and Iceberg.