Why bother with Data Modelling - Part 4: Composite Keys

Written by Database Doctor | Apr 29, 2026 10:46:19 AM

In the last three chapters, we established the basic terminology we will use when talking about data modelling. You learned about various types of keys.

When you combine multiple attributes to form a unique key, we call that a "Composite key". There are good reasons for avoiding them, and that is what we will talk about today.

I will also provide some examples of when composite keys add useful structure to a data model.

A classic Example: Cities in Countries

In our fictional retailer data model, we previously encountered the problem of city names in countries. It turns out there is a "Paris" in both "Denmark" and "France". When we model the entity City we then know that the name of the city isn't the natural key. A city only becomes unique when combined with the country that city is in. Thus, we can say that the natural key for City is the combination of City.name and City.country_id. Apologies to German readers who live in one of the two Frankfurts.

Every time we want to refer to a City from another Entity Type, it means we must store both City.name and City.country_id in every table that needs to reference the City entity.

Imagine that we decided to track visitors to our web site. With a bit of IP magic, we can find their origin city and we could have a table like this:

Table: WebTraffic

city_name	country_id	ip	browser
Aarhus	DK	87.49.216.34	Chrome 135
Paris	DK	80.62.117.203	Safari 18
Paris	FR	90.84.193.57	Firefox 137
Paris	FR	176.134.22.91	Chrome 134
Paris	FR	82.64.201.144	Edge 135
Paris	FR	86.212.73.18	Mobile Safari 18
Berlin	DE	91.64.118.76	Firefox 136
Berlin	DE	79.192.44.211	Chrome 135
Aarhus	DK	85.27.166.59	Edge 134
Paris	DK	93.165.14.122	Firefox 137
Berlin	DE	84.56.209.33	Safari 18
Paris	FR	109.17.88.240	Chrome 135

Assume we also declare that the combination WebTraffic.city_name, WebTraffic.country_id uniquely identifies a City. We have now told our LLM and our query tools that both columns must be used to determine what city we are dealing with.

Notice that the combination of city_name and country_id is not a compact way to represent City in WebTraffic - at least not compared to using a Surrogate Key.

But worse, we are now introducing a new problem: query estimation errors and slower query access.

To understand why, let us look at an example. Let us assume our WebTraffic Entity Type has lots of entities (=rows) - several billion of them for example.

How much traffic do I have per city?

Consider the question: How much traffic do we have per city?

I promised I would not require you to know SQL to understand this series. But for those of you that do know it, this is the query we care about:

SELECT city_name, country_id, COUNT(*)
FROM WebTraffic
GROUP BY city_name, country_id

Problem 1: Resource Estimation Errors

Most databases - and that includes nearly every database in the cloud - are very good at estimating the number of distinct values in a single column. If we asked the question "How many different countries are in WebTraffic?" the database would confidently be able to tell us (within an error margin only a few percent from the real value) how many are approximately there - without even looking at the billions of rows that are actually in the table.

Several algorithms exist that allow for good estimation, such as:

With this knowledge of the rough number of unique values (often called the "N distinct") the database can know how much memory it needs to run our query - before it looks at the actual data. This in turn leads to higher concurrency and better utilisation of hardware.

But once we have a composite key, things get tricky. We can still use the HyperLogLog and similar algorithms to store estimates of how many unique composite keys exist. But which combinations of columns should we now compute statistics about? In a 1000 column table, there are nearly 1M different 2 column combinations that we might be interested in. And that assumes that the database is even able to compute such multi column statistics. This is not a widespread ability for databases, particularly not those that run in the cloud. Incidentally, this is one of the things that FloeCat is trying to address.

Problem 2: Performance

Consider two competing models for WebTraffic.

Model	Key for city
Composite Key	WebTraffic.city_name, WebTraffic.country_id
Surrogate Key	WebTraffic.city_sk -> City.city_sk

If you want to try this yourself, I have included a small DuckDB dataset generator in the appendix. It creates both models from the same synthetic traffic data so you can compare the group-by on composite string keys with the group-by on a surrogate integer key.

Recall that we are interested in the traffic per city in WebTraffic and that this table is very large.

What does the composite key require the database to do:

Look at every row in WebTraffic
Compare the attributes WebTraffic.city_name, WebTraffic.country_id with the values we already have aggregated in some data structure
- Typically this is a hash table
- This compares two strings with other strings for every row in WebTraffic

Now, consider the same thing in a model using a surrogate key:

Look at every row in WebTraffic
Compare the integer WebTraffic.city_sk with the values we already have
After aggregating all the data - join to City via WebTraffic.city_sk -> City to find city_name and country_id

As long as the number of distinct cities is significantly smaller than the number of rows in WebTraffic, the surrogate key model is significantly faster.

I did a little experiment with DuckDB with 1B rows in WebTraffic and 16 different cities (script at the end of this blog - feel free to try it out on other engines).

Results:

Model	Key for city	Runtime
Composite Key	WebTraffic.city_name, WebTraffic.country_id	3.3s
Surrogate Key	WebTraffic.city_sk -> City.city_sk	1.0s

Even with the join (and joins are not expensive when done right) the surrogate key model is about 3x faster.

Example: Time Tracking

Composite keys often hide business rules that must be carefully applied when querying the data model. Let us see why, by digging into an example from Part 2. Recall that we can use tables to model the history of entities.

In our specific case, we cared about the Product changing name. This is the model we used:

Table: ProductHistory

sku	valid_from	product
2001	2026-05-01 09:00	Winter Jacket
2001	2026-05-03 12:00	Alpine Shell
3104	2026-05-20 14:40	Trail Runner
4110	2026-05-02 10:00	Espresso Grinder
5102	2026-05-06 14:20	Jeans
7305	2026-05-04 10:00	Coffee Cup
8402	2026-05-12 08:00	Aeropress
8601	2026-05-12 08:25	French Press
8702	2026-05-14 11:00	Pour Over Kettle
8805	2026-05-19 10:00	Burr Grinder

What is the key of this table? Consider this (true) statement about reality:

"The same sku cannot have two different names at the same time!"

Once we realise that, we can say that a composite key exists on this table, namely: ProductHistory.sku, ProductHistory.valid_from. This prevents two valid_from dates from applying at the same time to the same sku. It also builds an implicit rule:

"A sku is valid until the next valid_from of the same sku"

Recall that whenever we have rules instead of structure, queries accessing the data model increase in complexity.

Alternatively, we could represent the data like this:

Table: ProductHistory

sku	valid_from	valid_to	product
2001	2026-05-01 09:00	2026-05-03 12:00	Winter Jacket
2001	2026-05-03 12:00	9999-12-31 00:00	Alpine Shell

We can keep the same key - but now we must run the rule (to keep valid_to updated) every time we insert a record into the ProductHistory table. We "materialise" the rule into a separate column.

But even valid_to and our composite key of ProductHistory.sku, ProductHistory.valid_from things remain complicated and unwieldy. The rules of the model are still not made into the structure of the model.

To see why, recall that we also have this table in our data model:

Table: ActionTransaction

event_ts	event_type	user_id	sku	value
2026-05-02 14:00	purchase	U001	2001	129
2026-05-01 09:10	purchase	U001	2001	129
2026-05-04 10:00	purchase	U001	2001	100
2026-05-05 10:30	purchase	U001	2001	100

How are we to answer this question:

"Show me sales of sku "2001" using the correct name of the Product at the time of the sale"

What must we now do?

Travelling via the relationship ActionTransaction.sku -> ProductHistory find all matching sku in the two tables
Using the ActionTransaction.event_ts, go over all ranges of ProductHistory.valid_from and ProductHistory.valid_to to find the matching entry of product

Notice how we still have an implicit rule in step 2. The matching done in step 2 is also difficult for many databases to deal with (it increases the cost of the join significantly)

Capturing History with Surrogate Keys

Is there an alternative solution where we can just join tables without needing to deal with rule 2? There is, but it requires redundancy in the model and additional structure. We could introduce a new surrogate key that represents the historical record uniquely:

Table: ProductHistory

sku	valid_from	valid_to	product	sku_sk
2001	2026-05-01 09:00	2026-05-03 12:00	Winter Jacket	1
2001	2026-05-03 12:00	9999-12-31 00:00	Alpine Shell	2

By applying our historical matching rules during data loading, we can do this for ActionTransaction

event_ts	event_type	user_id	sku	sku_sk	value
2026-05-01 09:10	purchase	U001	2001	1	129
2026-05-02 14:00	purchase	U001	2001	1	129
2026-05-04 10:00	purchase	U001	2001	2	100
2026-05-05 10:30	purchase	U001	2001	2	100

Here, we again create structure that captures the rules of the model - instead of applying the rules while we query it.

Answering the question now becomes as simple as:

By travelling via the relationship ActionTransaction.sku_sk -> ProductHistory.sku_sk find the matching ProductHistory.product

In other words, we moved query-time logic to load-time logic.

Example: Many-to-Many relationships

We have previously seen that attributes in entities can have relationships. For example, we have the relationship User.city_id -> City and City.country_id -> Country to capture the hierarchical relationship between users and the cities and countries they live in.

These are all examples of one-to-many relationships: Every Country can have many City, but each City belongs to one Country.

Imagine that our fictional stores also want to track who can supply products. We introduce a new entity type: Supplier. A simple model would be:

Table: Supplier

supplier_sk	supplier
60000	Rocket Coffee
60001	Surf Cafe

In our reality, we observe:

"Rocket Coffee": Supplier sells the "Pour Over Kettle": Product with "8702": Product.sku.
Both "Rocket Coffee" and "Surf Cafe" sell the "Espresso Grinder": Product with "4110": Product.sku

How do we model that the same product can be sold by more than one supplier? This is where many-to-many relationships come in. These are typically modelled by creating an intermediate table. In our example, we could capture reality like this:

Table: SupplierSellProduct

supplier_sk	sku
60000	8702
60000	4110
60001	4110

And we would declare these relationships:

SupplierSellProduct.supplier_sk -> Supplier
SupplierSellProduct.sku -> Product

We have modeled a many-to-many relationship. And we can make the model even stronger with a composite key. We know that a supplier can only sell a product once, it makes no sense to have duplicates in SupplierSellProduct. We can declare a composite key, namely: SupplierSellProduct.supplier_sk, SupplierSellProduct.sku. Here, we could invent a surrogate key that represents the many-to-many entry in SupplierSellProduct. But nothing is really gained from this, and we have found one of the cases where using a composite key makes a lot of sense.

Summary

In today's chapter, we went over the controversial subject of composite keys.

Composite keys can be powerful - but be aware of the downsides they introduce:

They often confuse database engines due to the way they interact with joins
Using a surrogate key often leads to a more compressed and higher-performing data model than the same model with a composite key
- And surrogate keys allow certain "shortcuts" through the model assisting with joins and aggregates.
When used to model history, composite keys implicitly introduce rules that users of the model (including LLMs) must carefully understand and apply
- These rules introduce a brittleness, both for users and the database engine trying to run them
- Often, you are better off inventing a surrogate key that represents the historical record
When modelling many-to-many relationships, a composite key captures important information about the structure of the model

Appendix: Model Compare Script

DROP TABLE IF EXISTS city;
DROP TABLE IF EXISTS webtraffic_comp;
DROP TABLE IF EXISTS webtraffic_sk;

CREATE TABLE city AS
SELECT *
FROM (
    VALUES
        (1, 'Aarhus', 'DK'),
        (2, 'Paris', 'DK'),
        (3, 'Paris', 'FR'),
        (4, 'Berlin', 'DE'),
        (5, 'Frankfurt', 'DE'),
        (6, 'Frankfurt', 'PL'),
        (7, 'Springfield', 'US'),
        (8, 'Springfield', 'CA'),
        (9, 'San Jose', 'US'),
        (10, 'San Jose', 'CR'),
        (11, 'Melbourne', 'AU'),
        (12, 'Melbourne', 'US'),
        (13, 'Valencia', 'ES'),
        (14, 'Valencia', 'VE'),
        (15, 'London', 'GB'),
        (16, 'London', 'CA')
) AS t(city_sk, city_name, country_id);

CREATE TABLE webtraffic_sk AS
WITH base AS (
    SELECT
        i AS visit_id,
        ((i - 1) % (SELECT COUNT(*) FROM city)) + 1 AS city_sk,
        CASE i % 6
            WHEN 0 THEN 'Chrome 135'
            WHEN 1 THEN 'Safari 18'
            WHEN 2 THEN 'Firefox 137'
            WHEN 3 THEN 'Edge 135'
            WHEN 4 THEN 'Mobile Safari 18'
            ELSE 'Chrome 134'
        END AS browser,
        format(
            '%s.%s.%s.%s',
            11 + (hash(i * 17) % 209),
            hash(i * 31) % 256,
            hash(i * 47) % 256,
            1 + (hash(i * 53) % 254)
        ) AS ip
    FROM generate_series(1, 1_000_000_000) AS gs(i)
)
SELECT
    visit_id,
    city_sk,
    ip,
    browser
FROM base;

CREATE TABLE webtraffic_comp AS
SELECT
    wt.visit_id,
    c.city_name,
    c.country_id,
    wt.ip,
    wt.browser
FROM webtraffic_sk wt
JOIN city c USING (city_sk);

ANALYZE city;
ANALYZE webtraffic_sk;
ANALYZE webtraffic_comp;

SELECT 'row_count_city' AS metric, COUNT(*)::VARCHAR AS value FROM city
UNION ALL
SELECT 'row_count_webtraffic_sk', COUNT(*)::VARCHAR FROM webtraffic_sk
UNION ALL
SELECT 'row_count_webtraffic_comp', COUNT(*)::VARCHAR FROM webtraffic_comp
UNION ALL
SELECT 'distinct_city_keys_in_comp', COUNT(*)::VARCHAR
FROM (SELECT DISTINCT city_name, country_id FROM webtraffic_comp)
UNION ALL
SELECT 'distinct_city_keys_in_sk', COUNT(DISTINCT city_sk)::VARCHAR
FROM webtraffic_sk;

Queries:

SELECT city_name, country_id, c
FROM city
JOIN (SELECT city_sk, COUNT(*) AS c FROM webtraffic_sk GROUP BY ALL) AS g
  USING (city_sk)
GROUP BY ALL;

SELECT city_name, country_id, COUNT(*)
FROM webtraffic_comp
GROUP BY ALL;

View full post