How Variant will Replace JSONB in Lakehouses

Written by Jin | Apr 22, 2026 4:28:35 PM

In the last post, we explored how binary encoding can speed up retrieval for JSON records, and semi-structured data in general. That was a significant improvement from storing plain-text JSON in a database. Now, with Iceberg Table v3 introducing a native variant type and Parquet standardizing a variant representation, adoption is accelerating across the tooling ecosystem. In this post, we’ll explore how lakehouse users can benefit from variant.

What Is `variant`?

variant is a binary format for semi-structured data and can represent data beyond JSON. For example, consider a record like this:

{
  "event_type": "checkout_started",
  "ts": "2026-04-14T10:15:21Z",
  "session": {
    "device": "macos",
    "app_version": "26.0.1"
  }
}

After the data is stored as variant, we can query it with JSONPath-style expressions such as:

SELECT payload:$.session.device
FROM event_table;

In Parquet, the most common file format for lakehouses, a variant value is typically stored as a group with two binary fields, metadata and value, annotated with the VARIANT logical type:

optional group payload (VARIANT) {
  required binary metadata;
  required binary value;
}

Why Use `variant`?

There are four main reasons to use variant:

1. Richer Types

variant goes beyond JSON primitive types. It supports richer primitive types such as date, timestamp, timestamptz, decimal, and UUID, so query engines do not have to fall back to representing every fields as JSON strings. That leads to more accurate data representation and fewer type-related compromises.

2. More Efficient Storage

Because variant is binary, it stores values more compactly than plain-text JSON. Fields are encoded in binary representation. Object keys can be deduplicated in metadata (from payload's schema above) rather than repeated in every value, resulting in lower storage overhead.

3. Faster Retrieval

variant avoids much of the repeated structural parsing work required with plain JSON. Instead of reparsing JSON text on every read, the query engine can navigate the binary structure directly and fetch only the fields it needs. This is especially useful for analytical workloads, where queries often touch only a small part of a larger payload. With plain JSON, extracting $.session.device requires parsing the entire string and walking the object tree. With variant, the engine can use binary metadata to jump more directly to the nested field. In addition, array offsets in the variant encoding support direct access; for example, large arrays store offsets to their children, so extracting an element is O(1).

4. An Open Ecosystem Standard

This is the part that matters most for lakehouses. Traditional databases each define their own jsonb-style encoding, which ties data to a specific system. variant is different: it is an open standardized type in Parquet, with Iceberg defining the table semantics around it. That means the same files can be read and written across different engines.

In other words, variant is not just a better encoding. It is a shared one. That interoperability is why adoption is moving quickly across the ecosystem: once the format is standardized, libraries and query engines can all support the same representation.

Putting `variant` to Work

variant is most valuable when you need flexibility without giving up structure, performance, or usability. It offers many benefits while requiring little extra setup, and lets you keep semi-structured data in its original shape while still making it queryable.

Use Case 1: Observability Data

Observability payloads have a common envelope, but different services emit different attributes. OpenTelemetry standardized many of these fields, and they are best modelled as variant.

A realistic log or span event might look like this:

{
  "timestamp": "2026-04-14T10:15:21.231Z",
  "severity_text": "INFO",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "resource": {
    "service.name": "checkout",
    "service.namespace": "store",
    "service.instance.id": "checkout-7f9c6b87d9-rx2pl",
    "deployment.environment.name": "production",
    "host.name": "ip-10-42-1-17",
    "cloud.region": "eu-west-1"
  },
  "attributes": {
    "http.request.method": "POST",
    "url.path": "/api/checkout",
    "server.address": "api.store.example",
    "http.response.status_code": 200,
    "request.body.size": 1834,
    "cart.value": 89.5,
    "cart.currency": "GBP"
  }
}

This is hard to model well with a fixed relational schema. Some fields are always present, some appear only for HTTP traffic, and some are application-specific. variant provides a useful middle ground, here's one way to extract important fields as top-level columns:

CREATE TABLE observability_events (
  event_date DATE,
  signal_type STRING,
  service_name STRING,
  payload VARIANT
)

With this design, we can keep a few stable relational columns, but the evolving payloads are stored in variant. Then retrieval uses path expressions instead of reparsing a JSON string.

Use Case 2: Key/Value Attributes

Key-value attributes show up everywhere, for example in:

product catalog metadata
user profile custom fields
tags and labels
metrics dimensions
feature flags
security context

For example, a product record might look like this:

{
  "sku": "shoe-123",
  "category": "running_shoes",
  "price": 129.99,
  "attributes": {
    "color": "blue",
    "size": 8,
    "material": "mesh",
    "waterproof": false,
    "release_date": "2026-03-01"
  }
}

Or a hotel listing sample data:

{
  "listing_id": "htl-9831",
  "city": "London",
  "price_gbp": 249.0,
  "attributes": {
    "stars": 5,
    "has_breakfast": true,
    "pet_friendly": false,
    "amenities": ["wifi", "gym", "air_conditioning"]
  }
}

Some attribute might be missing, while others are added over time. variant allows us to store them flexibly, without forcing every possible attribute into the table schema. Otherwise, we might have to create a wide table with lots of nullable columns.

Typical Workflow

In practice, a common variant workflow in a lakehouse looks like this:

ingest data directly into a variant column, instead of storing raw JSON text
keep a small set of stable top-level columns
use JSON-path-style accessors to extract optional fields

If we flatten everything, the schema becomes sparse and wide. If we keep it as JSON text, every query becomes a parsing exercise. variant offers a middle path: it preserves flexibility while still enabling efficient access.

Optimization Opportunity

Parquet’s variant shredding spec lets writers extract selected fields from a variant payload into strongly typed Parquet columns. Commonly queried fields can be read much more efficiently. Instead of scanning the full semi-structured payload, readers can project only the typed columns they need.

Iceberg builds on that at the planning layer. It already uses column-level statistics such as value counts, null counts, lower bounds, and upper bounds to prune files before execution. In Iceberg v3, that idea extends to fields inside a variant column: statistics can be recorded for normalized JSON paths within the payload. That means predicates on nested fields can eliminate more files before the engine reads any data.

This is where variant goes beyond a convenience feature for semi-structured data and becomes a real performance tool in lakehouse systems.

Conclusion

Semi-structured data is a natural part of the data lakehouse. Logs, traces, events, attributes, and evolving business objects are not going away, so the lakehouse needs a format that can handle them natively.

variant is a major step in that direction. It provides the flexibility of JSON while adding richer types, efficient binary encoding, direct access to nested fields, and open interoperability through Parquet and Iceberg. With Iceberg v3 defining the table semantics and Parquet defining the file encoding, variant is emerging as a true open standard rather than just another proprietary jsonb dialect.

That is why adoption is accelerating across catalogs, libraries, and query engines. Once the representation is standardized, tools can understand the same column format instead of inventing incompatible encodings!

In future post we will look at how query engine can optimize access using shredded fields, nested-field stats, and smarter planning.

View full post