Why bother with data modelling - Part 2: Rules or Structure?

In the previous instalment we introduced the terminology and described why data models help both humans and machines make sense of data.

Today, we will apply these new skills to the test problem introduced in Part 1.

Building Rules or Building Models

Let us summarise where we are:

If we have a series of observations about the world, we can interpret meaning from those events
We create meaning by classifying observations into:
- Entity Types
- Entities
- Attributes / relationships
This classification removes ambiguity in data, helping humans and machines reason.

How do we actually encode this interpretation in a consumable form?

Here, we reach a choice that has divided software professionals for generations. Two distinct philosophies exist:

Use rules to interpret data into meaning (sometimes called "business logic")
Use the structure of the data to represent its meaning

The Rules-Based Approach

A common approach to capturing meaning is to encode it as a set of rules and heuristics. Those rules can be written into documents (this is what we do when creating legal systems) or they could be written in code.

For example:

When encountering relationship User.city -> City - translate it to English before displaying it

This helps us deal with cases like "Prague" = "Praha".
But it might not help us deal with "Aarhus" being the current spelling of the older official "Århus" (or its even older spelling of "Aros").

When encountering attribute Product.sku - always pick the latest attribute Product.name value as the product name for that Product.sku.

This helps us deal with "Winter Jacket" = "Alpine Shell"
But it isn't much good if we want to print an invoice for "Lisa": User, who has no idea what "Alpine Shell": Product.name is - because she thinks she bought a "Winter Jacket": Product.name.
It also means we must enumerate all observations before we can know the correct Product.name for each Product.sku
- The time needed is proportional to the total observations, not the total number of Product
- Fundamentally unscalable.

Define revenue as: pick all observations with "Purchase": Action and subtract all observations with "Refund": Action

This helps us report on revenue.
But it breaks the moment someone introduces a "Discount": Event Action

If an Action has value "view", the Action.country is the country the user browses from

This clarifies how to read "view" actions.
But it leaves it unclear what happens if the city column is present.

As we expand our rule set, we approximate the correct interpretations of the observations more closely. The big problem with the rules-based approach is that everyone who accesses the observations must agree on what rules to apply to correctly interpret the meaning of the data. In other words: the meaning of the data lives in the system accessing the data, not in the data structure itself.

Example: Rules-Based GA4

Rules-based data interpretation is perhaps most prominently used in Google Analytics (GA4). Most data in GA4 is stored in a single family of event tables (one table per day plus intra-day tables) with a moderately complex structure. To correctly interpret GA4 data, you must rely on the application interpreting the structure of this event table. For example, we may not know exactly who a visiting user is (unless they are authenticated). Instead, we apply heuristics to make an educated guess. You end up with the complicated dashboards you see in Google Analytics.

This way of organising data works for Google because tracking website events does not have to be precise! The structure of the data we receive cannot be fully known in advance and the answers we get from that data has a level of ambiguity that is acceptable: close enough for users to make reasonable decisions.

The Structure Approach - Meaning Encoded in the Data Model

Data modellers take a different approach to extracting meaning from observations.

Instead of applying a set of rules on every data access, they store data in such a way that the interpretation is already clear from the layout of the data.

Let us make this a concrete exercise with the sample dataset.

Observations about Entity Type: User

We have identified that there exists an entity type User. Instead of interpreting the data every time we access it, we declare what is true about User:

A User has an attribute: User.name
Observing a User.city -> City value in the data means that the user lives in that City
Each City belongs in one Country, so instead of using User.country -> Country we can instead infer it via City.country -> Country and just keep track of User.city -> City.
There exists a correct list of Country entities, and all values in City.country -> Country must be matched to this list.
Two City are different entities if they have different City.country -> Country relationships.
Unless we can get information elsewhere, two entities of type User are the same if they live in the same City and have the same User.name
- We want to keep the option to refine this model as our knowledge grows

Observations about Entity Type: Product

About Product we can say:

A product entity is uniquely identified by its Product.sku (data modellers call this a key)
Product entities have a Product.name and this value can change over time

We can also refine our understanding of the entity type Action and say:

Some actions such as "view", "add_to_cart", "wishlist_add" and "shipping_quote" are different from purchases and refunds
The interpretation of these browse-like actions is different from that of the other actions, and Action.country refers to the country the browser is visiting from.
Browse-like actions can still be associated with a known User, but the browsing location is not the same thing as User.city
The Action.country for purchases and refunds refers to the user's home address
This means that we are really dealing with at least two different kinds of actions: Action is not a single entity type

This will soon lead us to split our observations into a different data representation.

Data Model for: Country

We know that a master list of countries exists (at least in principle). That means we can safely state:

When we observe something that looks like a country, it must map to our master list of countries.

For our test data, these are the relevant countries:

Table: Country

country_id	country
CA	Canada
DK	Denmark
DE	Germany
CZ	Czech Republic
FR	France
GB	United Kingdom
US	United States

Notes:

Instead of letting Country.country uniquely identify the entity, it is often preferable to use a more compact representation.
- ISO-3166 is an obvious choice here
For readers from island nations, the distinction between England, Great Britain, and the United Kingdom may feel unusually slippery and this model of countries may seem insufficient.
- For the purpose of this blog, we map England into the GB ISO code.

Data Model for: City

Like Country, we know that City entities are actual places on Earth. Not only that, we know that cities belong to countries, and we have identified the relationship: City.country_id -> Country.

Inspired by airport codes, we can invent a simple set of identifiers for each city entity that allows us to tell "Paris" in "Denmark" apart from "Paris" in "France", and "London" in "Canada" apart from "London" in "United Kingdom".

Table: City

city_id	city	country_id
AAR	Aarhus	DK
BER	Berlin	DE
LON_CA	London	CA
LON_GB	London	GB
PAR_DK	Paris	DK
PAR_FR	Paris	FR
PRG	Prague	CZ
SJC	San Jose	US

We can now ask the same question we asked before. Using an anonymous context with ChatGPT:

Me: "Given the data above, how confident are you that cities belong to countries?"

And the answer is now:

ChatGPT: From a relational integrity perspective: Confidence ≈ 100% that cities belong to countries (by design) Why semantically questionable? The data itself contradicts real-world expectations: Paris → DK (PAR_DK) This is either: a data error, or a non-canonical naming (there is no well-known Paris in Denmark)

First, notice that ChatGPT now believes Paris in Denmark to be a data error, not evidence that the structure of the data is wrong.

You might want to have a conversation with ChatGPT about Paris in Denmark and see how it ties itself in knots.

Evidence:

(There is also a Rome in Denmark, but that's a different story.)

Data Model for: User

Since we now know that City determines country via City.country_id -> Country, we can safely get rid of country from User and restructure our data to make sure it is no longer possible to store users that have invalid City/Country combinations.

Notice that we are not doing this to save space; we are doing it to make the semantics of our data clear to consumers of that data.

Our user table now becomes:

Table: User

user_id	user	city_id
U001	Lisa	PAR_DK
U002	Emma	LON_GB
U003	Hans	BER
U004	Hans	AAR
U005	Noah	LON_CA
U006	Sofia	PAR_DK
U007	Morgane	PAR_FR
U008	Michael	SJC
U009	Vlad	PRG

Data Model for: Product

We have identified that entities of type Product can be uniquely identified by their Product.sku - at least if we accept the interpretation that "Alpine Shell" is a rename of "Winter Jacket".

With this interpretation, we can now structure the data in the following way:

Table: Product

sku	product	size
2001	Alpine Shell	M
3104	Trail Runner	42
4110	Espresso Grinder
5102	Jeans	L
7305	Coffee Cup
8402	Aeropress
8601	French Press
8702	Pour Over Kettle
8805	Burr Grinder

And if we care about previous product names, we can create an additional table that tracks historical names:

Table: ProductHistory

sku	valid_from	product
2001	2026-05-01 09:00	Winter Jacket
2001	2026-05-03 12:00	Alpine Shell
3104	2026-05-20 14:40	Trail Runner
4110	2026-05-02 10:00	Espresso Grinder
5102	2026-05-06 14:20	Jeans
7305	2026-05-04 10:00	Coffee Cup
8402	2026-05-12 08:00	Aeropress
8601	2026-05-12 08:25	French Press
8702	2026-05-14 11:00	Pour Over Kettle
8805	2026-05-19 10:00	Burr Grinder

Recall that we also considered a special entity type called Apparel which is a subset of all products and that we assumed that Apparel has sizes and other products do not.

If that is the interpretation we want to communicate, we can split the product structure further:

Table: Product

sku	product
2001	Alpine Shell
3104	Trail Runner
4110	Espresso Grinder
5102	Jeans
7305	Coffee Cup
8402	Aeropress
8601	French Press
8702	Pour Over Kettle
8805	Burr Grinder

... and track Apparel.size (previously associated directly with Product.size) separately:

Table: Apparel

sku	product	size
2001	Alpine Shell	M
3104	Trail Runner	42
5102	Jeans	L

Data Model for: Action

We identified that we are actually dealing with two different types of actions in our observations - each with their own attributes and relationships.

This knowledge can be communicated by splitting the browse-like Actions into their own entity type ActionView.

Table: ActionView

viewed_at	action	user_id	browsing_city_id	sku
2026-05-01 09:00	view		LON_GB	2001
2026-05-06 14:20	add_to_cart	U002	LON_GB	5102
2026-05-12 08:00	view	U005	LON_GB	8402
2026-05-12 08:05	add_to_cart	U005	LON_GB	8402
2026-05-14 11:00	wishlist_add	U006	PAR_FR	8702
2026-05-14 11:03	shipping_quote	U006	PAR_FR	8702

The rest of the actions, which we could call ActionTransaction, can now be structured like this:

Table: ActionTransaction

event_ts	event_type	user_id	sku	value
2026-05-01 09:10	purchase	U001	2001	129.9900
2026-05-03 12:00	refund	U001	2001	-99.9900
2026-05-10 11:00	purchase	U002	5102	59.0000
2026-05-11 15:05	refund	U002	5102	-29.0000
2026-05-20 14:40	purchase	U003	3104	89.5000
2026-05-02 10:00	purchase	U004	4110	249.0000
2026-05-04 10:00	purchase	U004	7305	18.0000
2026-05-12 08:10	purchase	U005	8402	24.5000
2026-05-12 08:25	purchase	U005	8601	34.0000
2026-05-13 09:00	refund	U005	8601	-10.0000
2026-05-14 11:10	purchase	U006	8702	44.0000
2026-05-15 18:30	purchase	U007	8402	24.5000
2026-05-17 07:55	purchase	U008	7305	19.9900
2026-05-17 08:02	refund	U008	7305	-19.9900
2026-05-19 10:00	purchase	U009	8805	99.0000
2026-05-19 10:10	refund	U009	8805	-20.0000

Notice that we only need to know the user (via the relationship ActionTransaction.user_id -> User) - the City and Country are now fully derived from the user.

For browse-like actions, the browsing location is tracked separately from the user's home city. That means "Noah" can still be the same User while browsing from LON_GB and purchasing as a resident of LON_CA.

Side note: Why table names are often singular not plural

If you already have some experience with databases, you will no doubt have noticed that tables in databases are often named in the singular. Why do data modellers call the table "User" and not "Users"? Isn't the table just a collection of User entities? No, the table is a type of something, namely User: Entity Type. That is why we typically use singular naming.

For you Python programmers out there, you don't write x: ints; you write x: int.

Isn't Data modelling a lot of work?

The examples we just walked through can seem overwhelming, and we only handled a very small number of entity types.

By splitting our observations into tables that represent each entity type, we have also created a new problem for ourselves: "How do we reassemble the full observation from each of these small tables?"

The answer to that is of course: "We join the tables back together again".

By communicating meaning via the structure of the data, we have told our data consumers (including the LLM) that only some combinations of data are valid. We have created constraints, which is a form of reasoning.

As a human, this may seem complicated. For a machine, we have just provided exactly what was needed to correctly query the data.

Tokens and Reasoning

Let's try out our two models with a clean context in the LLM.

First, I provide ChatGPT with the raw observations and ask it to resolve the ambiguities for me.

Me: Resolve the location values to the real city and country they most likely refer to, then compute revenue per city and country. (paste raw data)

ChatGPT:

Aarhus, Denmark 267.00 Berlin, Germany 89.50 London, United Kingdom 78.50 Paris, France 98.50 Prague, Czech Republic 79.00 San Jose, United States 0.00

Estimated Tokens for Response: ~600 tokens

Now, let's serve the modelled data instead of asking for the same disambiguation:

Me: Resolve the location values to the real city and country they most likely refer to, then compute revenue per city and country. [paste the modelled data]

ChatGPT:

Aarhus, Denmark 267.00 Berlin, Germany 89.50 London, Canada 48.50 London, United Kingdom 30.00 Paris, Denmark 74.00 Paris, France 24.50 Prague, Czech Republic 79.00 San Jose, United States 0.00

(incidentally, ChatGPT also outputs the SQL to run this query)

Estimated Tokens for Response: ~250-300 tokens

Obviously this is a crude test. But the key observation here is that the LLM needs to think less when it works on data that has structure. Even though ChatGPT resists the idea of a Paris in Denmark, it still honours the data model over the disambiguation.

A spectrum of models

Let us assume you buy my premise: meaning is best encoded in the structure of the data, not in complex rules operating on top of it.

The question becomes: "Who or what brings about this structure?"

This is where data engineers come in! These hard-working individuals toil away with tools that take data that lacks structure and try to bring order to the chaos.

Source systems, and even some analytical systems, often provide data created by individuals who believe that rules, not structure, are the way to capture meaning. The data engineer reverses this choice, carefully reconstructing structure.

There are different ways to apply structure, and we can view data modelling as a spectrum spanning two major strategies:

Specialised Tables
Fully modelled, generic data

The data engineer must carefully navigate the space between these two extremes.

Specialised Tables (aka: One Big Table)

The specialised tables often take the form of large, wide tables. Each table is carefully crafted by a data engineer to serve a specific business purpose. The structure of such tables resembles a cleaned-up observation structure, making sure "Praha" is stored as "Prague", that "Paris" in "Denmark" is treated properly, and so on. In a certain way, they are gigantic spreadsheets where every datapoint is one wide row. Entity types are mixed up freely in the same row - but carefully curated. Exploration of such tables is simple and often easy to understand, even for untrained users - at least if you ask the questions that table is designed to answer.

But once you start to explore the data in complex ways, the implicit rules we used to build the table show up. Interpretation becomes difficult and you often wish that the table would be broken into the entity types it is made of.

Iterating quickly, as LLMs allow you to do, takes significant time and tokens - because meaning is no longer encoded in the structure of the data but must be derived from wide rows.

Rich, fully modelled and "denormalised" data

This represents the splitting of all entity types into individual tables, carefully organised. In database theory, this type of modelling is called "normalisation". In the extreme cases, we end up with Data Vaults and other highly artificial structures which break the data into its atomic forms. These models are difficult to query - even for machines.

Intermediate forms exist that model entity types in separate objects, such as dimensional models (in business speak often known as "cubes"). These models take the best traits from normalised models and combine them with the ease of use that comes from specialised tables.

It is a careful balancing act to pick the right modelling technique.

Moving from one model to another

Irrespective of what model (or lack of one) you pick, you are faced with how to bring your existing data into that model.

Data engineers, and the people who came before them under different titles, use specialised tools to do this. Through the years, the tools have had many different names:

Examples:

Batch Processing
Data Integration
ETL and ELT
Data Pipelines and data engineering

The tools and their operators, irrespective of their changing names, all aim to do the same thing: bring data from one form into another.

What we call these desired model forms, and which one is currently considered the best one, have also changed frequently.

Examples:

DSS (Decision Support System)
OLAP / Dimensional Model and Staging areas
Star Schemas and Snowflake Schemas (not to be confused with Snowflake the company)
Enterprise Data Warehouse in 3rd Normal Form
Data Vaults in higher normal forms (with spatial tracking)
Medallion Architecture (Bronze, Silver, Gold)

Again, the idea remains the same: pick a data model that best serves the purpose of making decisions.

In this age of LLMs, my bet is that we are going to see a return to more normalised models.

Summary of Part 2

We have seen that the meaning and interpretation of entity types and their relations can be captured in two ways:

Via complex rules applied while querying the data
By restructuring the data itself so its form contains the interpretation.
- This is typically done via a process called "normalisation"

It is worth noting a common myth that won't go away.

Myth: Normalisation was historically done to save space and I/O.
Fact: While Normalisation does save space (and sometimes I/O), it is done to constrain data and structure information in such a way that the interpretation of that data is clear from its structure.

Where does Floe Fit?

At Floe, we believe that data lakes, particularly with Iceberg or Delta, can support rich, structured models that are normalised and use joins to query data, much like the models we used in older relational databases. You need a powerful query optimiser and a strong compute infrastructure to do this.

With Floe, you can apply richer modelling techniques. There is no need to constrain yourself to large, wide tables or long, complex and boring data pipelines. We think that more "classic" modelling techniques are making a comeback and that this will help LLMs reason better, faster and more cost-effectively over your data.

But of course, to have this conversation, we must continue talking about data modelling and the sacrifices our industry has made recently.

Until we meet again - may the data be with you.