In the previous instalment we introduced the terminology and described why data models help both humans and machines make sense of data.
Today, we will apply these new skills to the test problem introduced in Part 1.
Let us summarise where we are:
How do we actually encode this interpretation in a consumable form?
Here, we reach a choice that has divided software professionals for generations. Two distinct philosophies exist:
A common approach to capturing meaning is to encode it as a set of rules and heuristics. Those rules can be written into documents (this is what we do when creating legal systems) or they could be written in code.
For example:
When encountering relationship User.city -> City - translate it to English before displaying it
When encountering attribute Product.sku - always pick the latest attribute Product.name value as the product name for that Product.sku.
Define revenue as: pick all observations with "Purchase": Action and subtract all observations with "Refund": Action
If an Action has value "view", the Action.country is the country the user browses from
As we expand our rule set, we approximate the correct interpretations of the observations more closely. The big problem with the rules-based approach is that everyone who accesses the observations must agree on what rules to apply to correctly interpret the meaning of the data. In other words: the meaning of the data lives in the system accessing the data, not in the data structure itself.
Rules-based data interpretation is perhaps most prominently used in Google Analytics (GA4). Most data in GA4 is stored in a single family of event tables (one table per day plus intra-day tables) with a moderately complex structure. To correctly interpret GA4 data, you must rely on the application interpreting the structure of this event table. For example, we may not know exactly who a visiting user is (unless they are authenticated). Instead, we apply heuristics to make an educated guess. You end up with the complicated dashboards you see in Google Analytics.
This way of organising data works for Google because tracking website events does not have to be precise! The structure of the data we receive cannot be fully known in advance and the answers we get from that data has a level of ambiguity that is acceptable: close enough for users to make reasonable decisions.
Data modellers take a different approach to extracting meaning from observations.
Instead of applying a set of rules on every data access, they store data in such a way that the interpretation is already clear from the layout of the data.
Let us make this a concrete exercise with the sample dataset.
We have identified that there exists an entity type User. Instead of interpreting the data every time we access it, we declare what is true about User:
About Product we can say:
We can also refine our understanding of the entity type Action and say:
This will soon lead us to split our observations into a different data representation.
We know that a master list of countries exists (at least in principle). That means we can safely state:
When we observe something that looks like a country, it must map to our master list of countries.
For our test data, these are the relevant countries:
Table: Country
| country_id | country |
|---|---|
| CA | Canada |
| DK | Denmark |
| DE | Germany |
| CZ | Czech Republic |
| FR | France |
| GB | United Kingdom |
| US | United States |
Notes:
Like Country, we know that City entities are actual places on Earth. Not only that, we know that cities belong to countries, and we have identified the relationship: City.country_id -> Country.
Inspired by airport codes, we can invent a simple set of identifiers for each city entity that allows us to tell "Paris" in "Denmark" apart from "Paris" in "France", and "London" in "Canada" apart from "London" in "United Kingdom".
Table: City
| city_id | city | country_id |
|---|---|---|
| AAR | Aarhus | DK |
| BER | Berlin | DE |
| LON_CA | London | CA |
| LON_GB | London | GB |
| PAR_DK | Paris | DK |
| PAR_FR | Paris | FR |
| PRG | Prague | CZ |
| SJC | San Jose | US |
We can now ask the same question we asked before. Using an anonymous context with ChatGPT:
Me: "Given the data above, how confident are you that cities belong to countries?"
And the answer is now:
ChatGPT: From a relational integrity perspective: Confidence ≈ 100% that cities belong to countries (by design) Why semantically questionable? The data itself contradicts real-world expectations: Paris → DK (PAR_DK) This is either: a data error, or a non-canonical naming (there is no well-known Paris in Denmark)
First, notice that ChatGPT now believes Paris in Denmark to be a data error, not evidence that the structure of the data is wrong.
You might want to have a conversation with ChatGPT about Paris in Denmark and see how it ties itself in knots.
Evidence:
(There is also a Rome in Denmark, but that's a different story.)
Since we now know that City determines country via City.country_id -> Country, we can safely get rid of country from User and restructure our data to make sure it is no longer possible to store users that have invalid City/Country combinations.
Notice that we are not doing this to save space; we are doing it to make the semantics of our data clear to consumers of that data.
Our user table now becomes:
Table: User
| user_id | user | city_id |
|---|---|---|
| U001 | Lisa | PAR_DK |
| U002 | Emma | LON_GB |
| U003 | Hans | BER |
| U004 | Hans | AAR |
| U005 | Noah | LON_CA |
| U006 | Sofia | PAR_DK |
| U007 | Morgane | PAR_FR |
| U008 | Michael | SJC |
| U009 | Vlad | PRG |
We have identified that entities of type Product can be uniquely identified by their Product.sku - at least if we accept the interpretation that "Alpine Shell" is a rename of "Winter Jacket".
With this interpretation, we can now structure the data in the following way:
Table: Product
| sku | product | size |
|---|---|---|
| 2001 | Alpine Shell | M |
| 3104 | Trail Runner | 42 |
| 4110 | Espresso Grinder | |
| 5102 | Jeans | L |
| 7305 | Coffee Cup | |
| 8402 | Aeropress | |
| 8601 | French Press | |
| 8702 | Pour Over Kettle | |
| 8805 | Burr Grinder |
And if we care about previous product names, we can create an additional table that tracks historical names:
Table: ProductHistory
| sku | valid_from | product |
|---|---|---|
| 2001 | 2026-05-01 09:00 | Winter Jacket |
| 2001 | 2026-05-03 12:00 | Alpine Shell |
| 3104 | 2026-05-20 14:40 | Trail Runner |
| 4110 | 2026-05-02 10:00 | Espresso Grinder |
| 5102 | 2026-05-06 14:20 | Jeans |
| 7305 | 2026-05-04 10:00 | Coffee Cup |
| 8402 | 2026-05-12 08:00 | Aeropress |
| 8601 | 2026-05-12 08:25 | French Press |
| 8702 | 2026-05-14 11:00 | Pour Over Kettle |
| 8805 | 2026-05-19 10:00 | Burr Grinder |
Recall that we also considered a special entity type called Apparel which is a subset of all products and that we assumed that Apparel has sizes and other products do not.
If that is the interpretation we want to communicate, we can split the product structure further:
Table: Product
| sku | product |
|---|---|
| 2001 | Alpine Shell |
| 3104 | Trail Runner |
| 4110 | Espresso Grinder |
| 5102 | Jeans |
| 7305 | Coffee Cup |
| 8402 | Aeropress |
| 8601 | French Press |
| 8702 | Pour Over Kettle |
| 8805 | Burr Grinder |
... and track Apparel.size (previously associated directly with Product.size) separately:
Table: Apparel
| sku | product | size |
|---|---|---|
| 2001 | Alpine Shell | M |
| 3104 | Trail Runner | 42 |
| 5102 | Jeans | L |
We identified that we are actually dealing with two different types of actions in our observations - each with their own attributes and relationships.
This knowledge can be communicated by splitting the browse-like Actions into their own entity type ActionView.
Table: ActionView
| viewed_at | action | user_id | browsing_city_id | sku |
|---|---|---|---|---|
| 2026-05-01 09:00 | view | LON_GB | 2001 | |
| 2026-05-06 14:20 | add_to_cart | U002 | LON_GB | 5102 |
| 2026-05-12 08:00 | view | U005 | LON_GB | 8402 |
| 2026-05-12 08:05 | add_to_cart | U005 | LON_GB | 8402 |
| 2026-05-14 11:00 | wishlist_add | U006 | PAR_FR | 8702 |
| 2026-05-14 11:03 | shipping_quote | U006 | PAR_FR | 8702 |
The rest of the actions, which we could call ActionTransaction, can now be structured like this:
Table: ActionTransaction
| event_ts | event_type | user_id | sku | value |
|---|---|---|---|---|
| 2026-05-01 09:10 | purchase | U001 | 2001 | 129.9900 |
| 2026-05-03 12:00 | refund | U001 | 2001 | -99.9900 |
| 2026-05-10 11:00 | purchase | U002 | 5102 | 59.0000 |
| 2026-05-11 15:05 | refund | U002 | 5102 | -29.0000 |
| 2026-05-20 14:40 | purchase | U003 | 3104 | 89.5000 |
| 2026-05-02 10:00 | purchase | U004 | 4110 | 249.0000 |
| 2026-05-04 10:00 | purchase | U004 | 7305 | 18.0000 |
| 2026-05-12 08:10 | purchase | U005 | 8402 | 24.5000 |
| 2026-05-12 08:25 | purchase | U005 | 8601 | 34.0000 |
| 2026-05-13 09:00 | refund | U005 | 8601 | -10.0000 |
| 2026-05-14 11:10 | purchase | U006 | 8702 | 44.0000 |
| 2026-05-15 18:30 | purchase | U007 | 8402 | 24.5000 |
| 2026-05-17 07:55 | purchase | U008 | 7305 | 19.9900 |
| 2026-05-17 08:02 | refund | U008 | 7305 | -19.9900 |
| 2026-05-19 10:00 | purchase | U009 | 8805 | 99.0000 |
| 2026-05-19 10:10 | refund | U009 | 8805 | -20.0000 |
Notice that we only need to know the user (via the relationship ActionTransaction.user_id -> User) - the City and Country are now fully derived from the user.
For browse-like actions, the browsing location is tracked separately from the user's home city. That means "Noah" can still be the same User while browsing from LON_GB and purchasing as a resident of LON_CA.
If you already have some experience with databases, you will no doubt have noticed that tables in databases are often named in the singular. Why do data modellers call the table "User" and not "Users"? Isn't the table just a collection of User entities? No, the table is a type of something, namely User: Entity Type. That is why we typically use singular naming.
For you Python programmers out there, you don't write x: ints; you write x: int.
The examples we just walked through can seem overwhelming, and we only handled a very small number of entity types.
By splitting our observations into tables that represent each entity type, we have also created a new problem for ourselves: "How do we reassemble the full observation from each of these small tables?"
The answer to that is of course: "We join the tables back together again".
By communicating meaning via the structure of the data, we have told our data consumers (including the LLM) that only some combinations of data are valid. We have created constraints, which is a form of reasoning.
As a human, this may seem complicated. For a machine, we have just provided exactly what was needed to correctly query the data.
Let's try out our two models with a clean context in the LLM.
First, I provide ChatGPT with the raw observations and ask it to resolve the ambiguities for me.
Me: Resolve the location values to the real city and country they most likely refer to, then compute revenue per city and country. (paste raw data)
ChatGPT:
Aarhus, Denmark 267.00 Berlin, Germany 89.50 London, United Kingdom 78.50 Paris, France 98.50 Prague, Czech Republic 79.00 San Jose, United States 0.00
Estimated Tokens for Response: ~600 tokens
Now, let's serve the modelled data instead of asking for the same disambiguation:
Me: Resolve the location values to the real city and country they most likely refer to, then compute revenue per city and country. [paste the modelled data]
ChatGPT:
Aarhus, Denmark 267.00 Berlin, Germany 89.50 London, Canada 48.50 London, United Kingdom 30.00 Paris, Denmark 74.00 Paris, France 24.50 Prague, Czech Republic 79.00 San Jose, United States 0.00
(incidentally, ChatGPT also outputs the SQL to run this query)
Estimated Tokens for Response: ~250-300 tokens
Obviously this is a crude test. But the key observation here is that the LLM needs to think less when it works on data that has structure. Even though ChatGPT resists the idea of a Paris in Denmark, it still honours the data model over the disambiguation.
Let us assume you buy my premise: meaning is best encoded in the structure of the data, not in complex rules operating on top of it.
The question becomes: "Who or what brings about this structure?"
This is where data engineers come in! These hard-working individuals toil away with tools that take data that lacks structure and try to bring order to the chaos.
Source systems, and even some analytical systems, often provide data created by individuals who believe that rules, not structure, are the way to capture meaning. The data engineer reverses this choice, carefully reconstructing structure.
There are different ways to apply structure, and we can view data modelling as a spectrum spanning two major strategies:
The data engineer must carefully navigate the space between these two extremes.
The specialised tables often take the form of large, wide tables. Each table is carefully crafted by a data engineer to serve a specific business purpose. The structure of such tables resembles a cleaned-up observation structure, making sure "Praha" is stored as "Prague", that "Paris" in "Denmark" is treated properly, and so on. In a certain way, they are gigantic spreadsheets where every datapoint is one wide row. Entity types are mixed up freely in the same row - but carefully curated. Exploration of such tables is simple and often easy to understand, even for untrained users - at least if you ask the questions that table is designed to answer.
But once you start to explore the data in complex ways, the implicit rules we used to build the table show up. Interpretation becomes difficult and you often wish that the table would be broken into the entity types it is made of.
Iterating quickly, as LLMs allow you to do, takes significant time and tokens - because meaning is no longer encoded in the structure of the data but must be derived from wide rows.
This represents the splitting of all entity types into individual tables, carefully organised. In database theory, this type of modelling is called "normalisation". In the extreme cases, we end up with Data Vaults and other highly artificial structures which break the data into its atomic forms. These models are difficult to query - even for machines.
Intermediate forms exist that model entity types in separate objects, such as dimensional models (in business speak often known as "cubes"). These models take the best traits from normalised models and combine them with the ease of use that comes from specialised tables.
It is a careful balancing act to pick the right modelling technique.
Irrespective of what model (or lack of one) you pick, you are faced with how to bring your existing data into that model.
Data engineers, and the people who came before them under different titles, use specialised tools to do this. Through the years, the tools have had many different names:
Examples:
The tools and their operators, irrespective of their changing names, all aim to do the same thing: bring data from one form into another.
What we call these desired model forms, and which one is currently considered the best one, have also changed frequently.
Examples:
Again, the idea remains the same: pick a data model that best serves the purpose of making decisions.
In this age of LLMs, my bet is that we are going to see a return to more normalised models.
We have seen that the meaning and interpretation of entity types and their relations can be captured in two ways:
It is worth noting a common myth that won't go away.
At Floe, we believe that data lakes, particularly with Iceberg or Delta, can support rich, structured models that are normalised and use joins to query data, much like the models we used in older relational databases. You need a powerful query optimiser and a strong compute infrastructure to do this.
With Floe, you can apply richer modelling techniques. There is no need to constrain yourself to large, wide tables or long, complex and boring data pipelines. We think that more "classic" modelling techniques are making a comeback and that this will help LLMs reason better, faster and more cost-effectively over your data.
But of course, to have this conversation, we must continue talking about data modelling and the sacrifices our industry has made recently.
Until we meet again - may the data be with you.