In the previous instalment we introduced the terminology and described why data models help both humans and machines make sense of data.
Today, we will apply these new skills to the test problem introduced in Part 1.
Building Rules or Building Models
Let us summarise where we are:
- If we have a series of observations about the world, we can interpret meaning from those events
- We create meaning by classifying observations into:
- Entity Types
- Entities
- Attributes / relationships
- This classification removes ambiguity in data, helping humans and machines reason.
How do we actually encode this interpretation in a consumable form?
Here, we reach a choice that has divided software professionals for generations. Two distinct philosophies exist:
- Use rules to interpret data into meaning (sometimes called "business logic")
- Use the structure of the data to represent its meaning
The Rules-Based Approach
A common approach to capturing meaning is to encode it as a set of rules and heuristics. Those rules can be written into documents (this is what we do when creating legal systems) or they could be written in code.
For example:
When encountering relationship User.city -> City - translate it to English before displaying it
- This helps us deal with cases like "Prague" = "Praha".
- But it might not help us deal with "Aarhus" being the current spelling of the older official "Århus" (or its even older spelling of "Aros").
When encountering attribute Product.sku - always pick the latest attribute Product.name value as the product name for that Product.sku.
- This helps us deal with "Winter Jacket" = "Alpine Shell"
- But it isn't much good if we want to print an invoice for "Lisa": User, who has no idea what "Alpine Shell": Product.name is - because she thinks she bought a "Winter Jacket": Product.name.
- It also means we must enumerate all observations before we can know the correct Product.name for each Product.sku
- The time needed is proportional to the total observations, not the total number of Product
- Fundamentally unscalable.
Define revenue as: pick all observations with "Purchase": Action and subtract all observations with "Refund": Action
- This helps us report on revenue.
- But it breaks the moment someone introduces a "Discount": Event Action
If an Action has value "view", the Action.country is the country the user browses from
- This clarifies how to read "view" actions.
- But it leaves it unclear what happens if the city column is present.
As we expand our rule set, we approximate the correct interpretations of the observations more closely. The big problem with the rules-based approach is that everyone who accesses the observations must agree on what rules to apply to correctly interpret the meaning of the data. In other words: the meaning of the data lives in the system accessing the data, not in the data structure itself.
Example: Rules-Based GA4
Rules-based data interpretation is perhaps most prominently used in Google Analytics (GA4). Most data in GA4 is stored in a single family of event tables (one table per day plus intra-day tables) with a moderately complex structure. To correctly interpret GA4 data, you must rely on the application interpreting the structure of this event table. For example, we may not know exactly who a visiting user is (unless they are authenticated). Instead, we apply heuristics to make an educated guess. You end up with the complicated dashboards you see in Google Analytics.
This way of organising data works for Google because tracking website events does not have to be precise! The structure of the data we receive cannot be fully known in advance and the answers we get from that data has a level of ambiguity that is acceptable: close enough for users to make reasonable decisions.
The Structure Approach - Meaning Encoded in the Data Model
Data modellers take a different approach to extracting meaning from observations.
Instead of applying a set of rules on every data access, they store data in such a way that the interpretation is already clear from the layout of the data.
Let us make this a concrete exercise with the sample dataset.
Observations about Entity Type: User
We have identified that there exists an entity type User. Instead of interpreting the data every time we access it, we declare what is true about User:
- A User has an attribute: User.name
- Observing a User.city -> City value in the data means that the user lives in that City
- Each City belongs in one Country, so instead of using User.country -> Country we can instead infer it via City.country -> Country and just keep track of User.city -> City.
- There exists a correct list of Country entities, and all values in City.country -> Country must be matched to this list.
- Two City are different entities if they have different City.country -> Country relationships.
- Unless we can get information elsewhere, two entities of type User are the same if they live in the same City and have the same User.name
- We want to keep the option to refine this model as our knowledge grows
Observations about Entity Type: Product
About Product we can say:
- A product entity is uniquely identified by its Product.sku (data modellers call this a key)
- Product entities have a Product.name and this value can change over time
We can also refine our understanding of the entity type Action and say:
- Some actions such as "view", "add_to_cart", "wishlist_add" and "shipping_quote" are different from purchases and refunds
- The interpretation of these browse-like actions is different from that of the other actions, and Action.country refers to the country the browser is visiting from.
- Browse-like actions can still be associated with a known User, but the browsing location is not the same thing as User.city
- The Action.country for purchases and refunds refers to the user's home address
- This means that we are really dealing with at least two different kinds of actions: Action is not a single entity type
This will soon lead us to split our observations into a different data representation.
Data Model for: Country
We know that a master list of countries exists (at least in principle). That means we can safely state:
When we observe something that looks like a country, it must map to our master list of countries.
For our test data, these are the relevant countries:
Table: Country
| country_id | country |
|---|---|
| CA | Canada |
| DK | Denmark |
| DE | Germany |
| CZ | Czech Republic |
| FR | France |
| GB | United Kingdom |
| US | United States |
Notes:
- Instead of letting Country.country uniquely identify the entity, it is often preferable to use a more compact representation.
- ISO-3166 is an obvious choice here
- For readers from island nations, the distinction between England, Great Britain, and the United Kingdom may feel unusually slippery and this model of countries may seem insufficient.
- For the purpose of this blog, we map England into the GB ISO code.
Data Model for: City
Like Country, we know that City entities are actual places on Earth. Not only that, we know that cities belong to countries, and we have identified the relationship: City.country_id -> Country.
Inspired by airport codes, we can invent a simple set of identifiers for each city entity that allows us to tell "Paris" in "Denmark" apart from "Paris" in "France", and "London" in "Canada" apart from "London" in "United Kingdom".
Table: City
| city_id | city | country_id |
|---|---|---|
| AAR | Aarhus | DK |
| BER | Berlin | DE |
| LON_CA | London | CA |
| LON_GB | London | GB |
| PAR_DK | Paris | DK |
| PAR_FR | Paris | FR |
| PRG | Prague | CZ |
| SJC | San Jose | US |
We can now ask the same question we asked before. Using an anonymous context with ChatGPT:
Me: "Given the data above, how confident are you that cities belong to countries?"
And the answer is now:
ChatGPT: From a relational integrity perspective: Confidence ≈ 100% that cities belong to countries (by design) Why semantically questionable? The data itself contradicts real-world expectations: Paris → DK (PAR_DK) This is either: a data error, or a non-canonical naming (there is no well-known Paris in Denmark)
First, notice that ChatGPT now believes Paris in Denmark to be a data error, not evidence that the structure of the data is wrong.
You might want to have a conversation with ChatGPT about Paris in Denmark and see how it ties itself in knots.
Evidence:
(There is also a Rome in Denmark, but that's a different story.)
Data Model for: User
Since we now know that City determines country via City.country_id -> Country, we can safely get rid of country from User and restructure our data to make sure it is no longer possible to store users that have invalid City/Country combinations.
Notice that we are not doing this to save space; we are doing it to make the semantics of our data clear to consumers of that data.
Our user table now becomes:
Table: User
| user_id | user | city_id |
|---|---|---|
| U001 | Lisa | PAR_DK |
| U002 | Emma | LON_GB |
| U003 | Hans | BER |
| U004 | Hans | AAR |
| U005 | Noah | LON_CA |
| U006 | Sofia | PAR_DK |
| U007 | Morgane | PAR_FR |
| U008 | Michael | SJC |
| U009 | Vlad | PRG |
Data Model for: Product
We have identified that entities of type Product can be uniquely identified by their Product.sku - at least if we accept the interpretation that "Alpine Shell" is a rename of "Winter Jacket".
With this interpretation, we can now structure the data in the following way:
Table: Product
| sku | product | size |
|---|---|---|
| 2001 | Alpine Shell | M |
| 3104 | Trail Runner | 42 |
| 4110 | Espresso Grinder | |
| 5102 | Jeans | L |
| 7305 | Coffee Cup | |
| 8402 | Aeropress | |
| 8601 | French Press | |
| 8702 | Pour Over Kettle | |
| 8805 | Burr Grinder |
And if we care about previous product names, we can create an additional table that tracks historical names:
Table: ProductHistory
| sku | valid_from | product |
|---|---|---|
| 2001 | 2026-05-01 09:00 | Winter Jacket |
| 2001 | 2026-05-03 12:00 | Alpine Shell |
| 3104 | 2026-05-20 14:40 | Trail Runner |
| 4110 | 2026-05-02 10:00 | Espresso Grinder |
| 5102 | 2026-05-06 14:20 | Jeans |
| 7305 | 2026-05-04 10:00 | Coffee Cup |
| 8402 | 2026-05-12 08:00 | Aeropress |
| 8601 | 2026-05-12 08:25 | French Press |
| 8702 | 2026-05-14 11:00 | Pour Over Kettle |
| 8805 | 2026-05-19 10:00 | Burr Grinder |
Recall that we also considered a special entity type called Apparel which is a subset of all products and that we assumed that Apparel has sizes and other products do not.
If that is the interpretation we want to communicate, we can split the product structure further:
Table: Product
| sku | product |
|---|---|
| 2001 | Alpine Shell |
| 3104 | Trail Runner |
| 4110 | Espresso Grinder |
| 5102 | Jeans |
| 7305 | Coffee Cup |
| 8402 | Aeropress |
| 8601 | French Press |
| 8702 | Pour Over Kettle |
| 8805 | Burr Grinder |
... and track Apparel.size (previously associated directly with Product.size) separately:
Table: Apparel
| sku | product | size |
|---|---|---|
| 2001 | Alpine Shell | M |
| 3104 | Trail Runner | 42 |
| 5102 | Jeans | L |
Data Model for: Action
We identified that we are actually dealing with two different types of actions in our observations - each with their own attributes and relationships.
This knowledge can be communicated by splitting the browse-like Actions into their own entity type ActionView.
Table: ActionView
| viewed_at | action | user_id | browsing_city_id | sku |
|---|---|---|---|---|
| 2026-05-01 09:00 | view | LON_GB | 2001 | |
| 2026-05-06 14:20 | add_to_cart | U002 | LON_GB | 5102 |
| 2026-05-12 08:00 | view | U005 | LON_GB | 8402 |
| 2026-05-12 08:05 | add_to_cart | U005 | LON_GB | 8402 |
| 2026-05-14 11:00 | wishlist_add | U006 | PAR_FR | 8702 |
| 2026-05-14 11:03 | shipping_quote | U006 | PAR_FR | 8702 |
The rest of the actions, which we could call ActionTransaction, can now be structured like this:
Table: ActionTransaction
| event_ts | event_type | user_id | sku | value |
|---|---|---|---|---|
| 2026-05-01 09:10 | purchase | U001 | 2001 | 129.9900 |
| 2026-05-03 12:00 | refund | U001 | 2001 | -99.9900 |
| 2026-05-10 11:00 | purchase | U002 | 5102 | 59.0000 |
| 2026-05-11 15:05 | refund | U002 | 5102 | -29.0000 |
| 2026-05-20 14:40 | purchase | U003 | 3104 | 89.5000 |
| 2026-05-02 10:00 | purchase | U004 | 4110 | 249.0000 |
| 2026-05-04 10:00 | purchase | U004 | 7305 | 18.0000 |
| 2026-05-12 08:10 | purchase | U005 | 8402 | 24.5000 |
| 2026-05-12 08:25 | purchase | U005 | 8601 | 34.0000 |
| 2026-05-13 09:00 | refund | U005 | 8601 | -10.0000 |
| 2026-05-14 11:10 | purchase | U006 | 8702 | 44.0000 |
| 2026-05-15 18:30 | purchase | U007 | 8402 | 24.5000 |
| 2026-05-17 07:55 | purchase | U008 | 7305 | 19.9900 |
| 2026-05-17 08:02 | refund | U008 | 7305 | -19.9900 |
| 2026-05-19 10:00 | purchase | U009 | 8805 | 99.0000 |
| 2026-05-19 10:10 | refund | U009 | 8805 | -20.0000 |
Notice that we only need to know the user (via the relationship ActionTransaction.user_id -> User) - the City and Country are now fully derived from the user.
For browse-like actions, the browsing location is tracked separately from the user's home city. That means "Noah" can still be the same User while browsing from LON_GB and purchasing as a resident of LON_CA.
Side note: Why table names are often singular not plural
If you already have some experience with databases, you will no doubt have noticed that tables in databases are often named in the singular. Why do data modellers call the table "User" and not "Users"? Isn't the table just a collection of User entities? No, the table is a type of something, namely User: Entity Type. That is why we typically use singular naming.
For you Python programmers out there, you don't write x: ints; you write x: int.
Isn't Data modelling a lot of work?
The examples we just walked through can seem overwhelming, and we only handled a very small number of entity types.
By splitting our observations into tables that represent each entity type, we have also created a new problem for ourselves: "How do we reassemble the full observation from each of these small tables?"
The answer to that is of course: "We join the tables back together again".
By communicating meaning via the structure of the data, we have told our data consumers (including the LLM) that only some combinations of data are valid. We have created constraints, which is a form of reasoning.
As a human, this may seem complicated. For a machine, we have just provided exactly what was needed to correctly query the data.
Tokens and Reasoning
Let's try out our two models with a clean context in the LLM.
First, I provide ChatGPT with the raw observations and ask it to resolve the ambiguities for me.
Me: Resolve the location values to the real city and country they most likely refer to, then compute revenue per city and country. (paste raw data)
ChatGPT:
Aarhus, Denmark 267.00 Berlin, Germany 89.50 London, United Kingdom 78.50 Paris, France 98.50 Prague, Czech Republic 79.00 San Jose, United States 0.00
Estimated Tokens for Response: ~600 tokens
Now, let's serve the modelled data instead of asking for the same disambiguation:
Me: Resolve the location values to the real city and country they most likely refer to, then compute revenue per city and country. [paste the modelled data]
ChatGPT:
Aarhus, Denmark 267.00 Berlin, Germany 89.50 London, Canada 48.50 London, United Kingdom 30.00 Paris, Denmark 74.00 Paris, France 24.50 Prague, Czech Republic 79.00 San Jose, United States 0.00
(incidentally, ChatGPT also outputs the SQL to run this query)
Estimated Tokens for Response: ~250-300 tokens
Obviously this is a crude test. But the key observation here is that the LLM needs to think less when it works on data that has structure. Even though ChatGPT resists the idea of a Paris in Denmark, it still honours the data model over the disambiguation.
A spectrum of models
Let us assume you buy my premise: meaning is best encoded in the structure of the data, not in complex rules operating on top of it.
The question becomes: "Who or what brings about this structure?"
This is where data engineers come in! These hard-working individuals toil away with tools that take data that lacks structure and try to bring order to the chaos.
Source systems, and even some analytical systems, often provide data created by individuals who believe that rules, not structure, are the way to capture meaning. The data engineer reverses this choice, carefully reconstructing structure.
There are different ways to apply structure, and we can view data modelling as a spectrum spanning two major strategies:
- Specialised Tables
- Fully modelled, generic data
The data engineer must carefully navigate the space between these two extremes.
Specialised Tables (aka: One Big Table)
The specialised tables often take the form of large, wide tables. Each table is carefully crafted by a data engineer to serve a specific business purpose. The structure of such tables resembles a cleaned-up observation structure, making sure "Praha" is stored as "Prague", that "Paris" in "Denmark" is treated properly, and so on. In a certain way, they are gigantic spreadsheets where every datapoint is one wide row. Entity types are mixed up freely in the same row - but carefully curated. Exploration of such tables is simple and often easy to understand, even for untrained users - at least if you ask the questions that table is designed to answer.
But once you start to explore the data in complex ways, the implicit rules we used to build the table show up. Interpretation becomes difficult and you often wish that the table would be broken into the entity types it is made of.
Iterating quickly, as LLMs allow you to do, takes significant time and tokens - because meaning is no longer encoded in the structure of the data but must be derived from wide rows.
Rich, fully modelled and "denormalised" data
This represents the splitting of all entity types into individual tables, carefully organised. In database theory, this type of modelling is called "normalisation". In the extreme cases, we end up with Data Vaults and other highly artificial structures which break the data into its atomic forms. These models are difficult to query - even for machines.
Intermediate forms exist that model entity types in separate objects, such as dimensional models (in business speak often known as "cubes"). These models take the best traits from normalised models and combine them with the ease of use that comes from specialised tables.
It is a careful balancing act to pick the right modelling technique.
Moving from one model to another
Irrespective of what model (or lack of one) you pick, you are faced with how to bring your existing data into that model.
Data engineers, and the people who came before them under different titles, use specialised tools to do this. Through the years, the tools have had many different names:
Examples:
- Batch Processing
- Data Integration
- ETL and ELT
- Data Pipelines and data engineering
The tools and their operators, irrespective of their changing names, all aim to do the same thing: bring data from one form into another.
What we call these desired model forms, and which one is currently considered the best one, have also changed frequently.
Examples:
- DSS (Decision Support System)
- OLAP / Dimensional Model and Staging areas
- Star Schemas and Snowflake Schemas (not to be confused with Snowflake the company)
- Enterprise Data Warehouse in 3rd Normal Form
- Data Vaults in higher normal forms (with spatial tracking)
- Medallion Architecture (Bronze, Silver, Gold)
Again, the idea remains the same: pick a data model that best serves the purpose of making decisions.
In this age of LLMs, my bet is that we are going to see a return to more normalised models.
Summary of Part 2
We have seen that the meaning and interpretation of entity types and their relations can be captured in two ways:
- Via complex rules applied while querying the data
- By restructuring the data itself so its form contains the interpretation.
- This is typically done via a process called "normalisation"
It is worth noting a common myth that won't go away.
- Myth: Normalisation was historically done to save space and I/O.
- Fact: While Normalisation does save space (and sometimes I/O), it is done to constrain data and structure information in such a way that the interpretation of that data is clear from its structure.
Where does Floe Fit?
At Floe, we believe that data lakes, particularly with Iceberg or Delta, can support rich, structured models that are normalised and use joins to query data, much like the models we used in older relational databases. You need a powerful query optimiser and a strong compute infrastructure to do this.
With Floe, you can apply richer modelling techniques. There is no need to constrain yourself to large, wide tables or long, complex and boring data pipelines. We think that more "classic" modelling techniques are making a comeback and that this will help LLMs reason better, faster and more cost-effectively over your data.
But of course, to have this conversation, we must continue talking about data modelling and the sacrifices our industry has made recently.
Until we meet again - may the data be with you.