Introducing Floecat: A Catalog of Catalogs for the Modern Lakehouse

Written by Mark | Feb 9, 2026 2:59:59 PM

In this post, I introduce Floecat, an open-source catalog-of-catalogs designed to address a growing gap in modern lakehouse architectures: the lack of a coherent planner-friendly metadata layer across heterogeneous catalogs.

Most modern data teams do not operate a single lakehouse catalog. In practice, Iceberg REST, Hive Metastore, AWS Glue, Polaris, Unity Catalog, and a range of custom or filesystem-backed catalogs coexist across clouds, business units, and generations of infrastructure. This fragmentation is not generally accidental, emerging naturally as organizations grow, adopt new tools, acquire other companies, or migrate incrementally rather than all at once.

Query engines have made meaningful progress in federating data across these systems; metadata, however, has not kept pace. Each catalog brings its own conventions for schemas, statistics, naming, access control, and lifecycle management. The result is a metadata landscape that is siloed, uneven, and increasingly difficult to reason about at scale.

Floecat is an open source project built to address this ground truth. It is a catalog-of-catalogs that sits in front of existing Iceberg and Delta catalogs and presents them through a single, vendor-neutral interface. It does not replace upstream systems and it does not require data migration. Instead, it aggregates, normalizes, and enriches metadata so engines, tools, and increasingly AI agents can discover tables, schemas, snapshots, and statistics across heterogeneous environments without needing to understand the quirks of each catalog.

The Missing Layer in the Lakehouse Stack

Data federation has advanced rapidly. Engines like Trino, Spark, and DuckDB can query across object storage, warehouses, and lakehouse tables with increasing flexibility. Metadata federation, by contrast, remains largely unresolved.

Every catalog encodes its own worldview. Statistics may be present in one system and absent in another, schema representations may differ and authorization models rarely align cleanly. Even basic concepts such as namespace structure or snapshot identity vary across implementations. Over time, this fragmentation becomes structural: governance logic is duplicated, and optimizations are constrained to individual systems. Automation and agent-driven workflows struggle because metadata is inconsistent or incomplete. Floecat starts from the premise that metadata itself needs a unifying layer, independent of transactional ownership.

What Floecat Is and Is Not

Floecat acts as an aggregation and normalization layer for metadata. It connects to existing catalogs, extracts metadata in their native form, and translates it into a canonical representation. That representation is enriched with snapshot history, statistics, and planning context, then served through consistent APIs. Floecat's goal is not to redefine how metadata is authored or committed, but to make it coherent and consumable across systems.

Equally important is what Floecat does not attempt to do: it does not own data, and it does not dictate storage formats. Neither does it impose a single commit or branching model, and upstream catalogs remain authoritative for their data and transactional semantics. Instead, Floecat observes and organizes what already exists.

Why Another Catalog?

The natural question is why introduce another catalog when projects like Apache Gravitino, Apache Polaris, Nessie, and others already exist. Most existing catalogs are designed to be the catalog. Even when they support multiple backends or table formats, they assume a single authoritative metadata plane with well-defined transactional semantics. They focus on managing commits, schema evolution, branching, and access control, and they expect engines to integrate with them directly as a primary system. That model works well when an organization can standardize on one catalog, but many organizations cannot do this.

Large environments often already have multiple catalogs in production for valid reasons. Different teams standardize on different platforms. Mergers and acquisitions introduce parallel systems. Regulatory or cloud constraints prevent consolidation. Even within a single stack, migrations from Hive or Glue to Iceberg REST or Unity Catalog tend to be gradual rather than abrupt.

Floecat starts from the assumption that consolidation is usually promised and rarely delivered. It assumes plurality is normal across the analytical ecosystem. Rather than asking organizations to replace catalogs, Floecat aggregates them. Rather than enforcing a single transactional abstraction, it normalizes metadata after the fact. In this sense, Floecat is not a competitor to catalogs like Nessie or Polaris. It can sit in front of them. A Nessie-backed Iceberg catalog or a Polaris deployment is simply another upstream source. Floecat fills a gap that those systems are not designed to fill: a neutral layer that federates metadata across catalogs without owning their transactional semantics.

Compared to Apache Gravitino, Floecat is optimized around query planning and SQL engine integration. Gravitino emphasizes unified governance and metadata management across sources, while Floecat prioritizes planner-grade metadata, snapshot pinning, scan bundles, and Arrow-first system scans. This makes Floecat a stronger fit for engines that need a high-performance catalog surface and engine-specific system catalogs rather than a general governance hub.

A Canonical Metadata Plane

Internally, Floecat organizes metadata as a canonical graph rooted at an account boundary. Catalogs contain namespaces, namespaces contain tables and views, and tables evolve through snapshots over time.

This structure is familiar by design. A table discovered through Iceberg REST and a table discovered through Glue are represented using the same conceptual model. That consistency allows planners, governance tools, and automation systems to reason across catalogs without special-case logic.

Metadata is persisted in an immutable, versioned form. Versions are deterministic and downstream caches can be invalidated precisely. This approach favors reproducibility and predictability over in-place mutation, which is critical for systems that need to plan and reason under concurrency.

From Observation to Canonical State

Floecat deliberately decouples metadata ingestion from user-facing access. Connectors are responsible for interacting with upstream catalogs. Each connector understands how to discover namespaces, tables, schemas, snapshots, and statistics for a particular system. Its responsibility is translation rather than simply exposing upstream sources.

Ingestion runs asynchronously. When reconciliation is triggered, Floecat schedules background work to scan upstream catalogs and persist normalized metadata into its canonical repository. Interactive APIs never block on upstream catalog calls. Planning latency is not tied to external system performance. The result is a durable, versioned view of metadata that reflects upstream systems without being dependent on them in the query critical path.

Serving Metadata Where Engines Expect It

Floecat exposes metadata through multiple surfaces, all backed by the same core contracts. At its core is a strongly typed gRPC API that covers discovery, snapshot access, statistics retrieval, connector management, and query lifecycle coordination. Every internal component uses these same contracts, which keeps behavior consistent.

On top of this, Floecat provides an Iceberg REST gateway. Engines that already speak Iceberg REST can use Floecat as a catalog without modification. HTTP requests are translated into the same internal API calls, ensuring identical semantics.

For system metadata and diagnostics, Floecat favors Arrow-based streaming. This allows metadata to be consumed efficiently at scale, while still supporting more traditional access patterns when required.

Metadata in Service of Query Planning

Floecat is not just a discovery layer. It is explicitly designed to support query planning. Between the metadata repository and query planners sits a metadata graph cache. This cache provides fast name resolution, stable snapshot views, and deterministic invalidation. Planners can operate locally without repeatedly reaching into storage.

Query lifecycles are modeled explicitly. When a query begins, relevant snapshots are pinned to provide a consistent view. Scan metadata, including file lists, delete information, and statistics, is fetched as a coherent bundle. When the query completes, those pins are released.

Because scan metadata and statistics are populated during reconciliation, planners never need to call back into upstream catalogs. Planning becomes predictable, fast, and isolated from external availability.

A Practical Motivation: Powering the Floe SQL Engine

There is also a very pragmatic reason Floecat exists. Floecat was built to serve as the primary catalog for the Floe SQL compute platform that will be available later this year.

Floe is designed for highly interactive, ad hoc SQL workloads over lakehouse data. Queries are unpredictable, cross-domain, and increasingly generated by agents rather than humans. Latency matters. Planning quality matters. Reaching into multiple upstream catalogs during planning is not viable at scale.

Existing catalogs are not optimized for this role. They are designed around correctness of metadata mutation, not around high fan-out, low-latency metadata consumption. Using them directly in the query critical path ties planning latency to external services, complicates caching, and amplifies failure modes.

By aggregating, normalizing, and deriving new metadata ahead of time, Floecat gives Floe a catalog that is purpose-built for planning. Snapshots are already enumerated, and statistics are already extracted. Metadata is local, canonical, and stable under concurrency. Query planning becomes a predictable operation rather than a distributed negotiation.

This is not just an optimization, it reflects a deeper shift happening in the industry. Query engines are evolving faster than catalogs. They are being asked to reason semantically, to plan across heterogeneous sources, and to support agent-driven access patterns. Floecat provides a catalog substrate that allows an engine like Floe to evolve without inheriting the full complexity of the catalog landscape underneath.

System Metadata and Statistics as First-Class Inputs

Floecat treats system metadata as real metadata. System catalogs are assembled from built-in definitions and pluggable providers, filtered by engine type and version so planners see only what is relevant. Functions, operators, and other planner-visible constructs can be exposed consistently across engines.

Statistics are treated as shared infrastructure. Table-level, column-level, and file-level statistics are collected during reconciliation and stored alongside snapshot metadata. They are exposed through the same APIs used by planners and scanners. This makes statistics reusable, consistent, and available without rescanning data.

Why This Matters

By federating metadata without forcing migration, Floecat enables better discovery, more consistent governance, and higher-quality planning across diverse environments. It provides a neutral metadata plane that engines, tools, and automation systems can rely on, even as the underlying catalog landscape continues to evolve.

For Floe, this makes fast, predictable, ad hoc querying possible. For the broader ecosystem, it defines a pattern for how metadata can be aggregated and consumed without centralizing ownership.

Looking Ahead

Floecat is still evolving, but the core idea is stable. If you are living in a multi-catalog reality, you should not have to pretend you can consolidate your way out of it.

You can explore Floecat through the CLI and APIs, run it as an Iceberg REST catalog for engines like Trino and DuckDB, or extend it with connectors and system catalog overlays that match how your organization actually works.

Floecat is open source and designed to grow alongside the ecosystems it connects. The code is here: Floecat GitHub repo. If you try it and hit sharp edges, please open an issue. If you know exactly what you want it to do next, file a feature request. If you want to build, PRs are welcome.

Lakehouse stacks are becoming more pluralistic, not less, and metadata federation is no longer optional. Floecat is an attempt to make that federation explicit, principled, and durable.

View full post