Data modelling is the discipline of interpreting and extracting meaning by imposing structures on data. Like humans, LLMs benefit from such structure.
As we move into a future dominated by machines thinking for us, I expect that data modelling will be more important than ever.
Unfortunately, the discipline of data modelling is often obscured by changing terminologies that vary by vendor and by the never-ending hype cycles that plague our industry.
In this new blog series, I attempt to explain data modelling and data architecture in terms that can be understood by everyone - even if you have no formal training in computer science or relational algebra. This is no easy feat and it is one I take on with hesitation. It is my hope that as the series progresses, discussions will occur that help steer the course.
Example Data Problem
It is easier to talk about data modelling if we have a concrete example.
Imagine an online business with two divisions selling:
- Apparel
- Coffee Products
Our business wants to view all the data and analyse it across divisions. Both divisions load their data into a big table designed by a recently hired data expert.
Apparel Store
This store is the most advanced. There is an automated checkout process in place built on a modern shopping platform.
The web store sells these products:
- Winter Jacket
- Trail Runner
- Jeans
The apparel business has a busy marketing department, and they recently renamed their "Winter Jacket" to "Alpine Shell".
Customer addresses are entered via a validated web form.
Coffee Store
The coffee store tracks all its products in an Excel spreadsheet and manually dispatches products to customers. Customers enter their address in a freeform text field.
Products sold are:
- Coffee Cup
- Aeropress
- Espresso Grinder
- French Press
- Pour Over Kettle
- Burr Grinder
The price of "Coffee Cup" was recently increased.
Observations
The data engineer designed the following table to hold all information from our business. This is simply a raw representation of the input data (which is sometimes referred to as "Bronze" or "Staging").
| time | action | name | city | country | sku | product | size | value |
|---|---|---|---|---|---|---|---|---|
| 2026-05-01 09:00 | view | London | England | 2001 | Winter Jacket | M | 0.0000 | |
| 2026-05-01 09:10 | purchase | Lisa | Paris | Denmark | 2001 | Winter Jacket | M | 129.9900 |
| 2026-05-03 12:00 | refund | Lisa | Paris | Denmark | 2001 | Alpine Shell | M | -99.9900 |
| 2026-05-06 14:20 | add_to_cart | Emma | London | United Kingdom | 5102 | Jeans | L | 0.0000 |
| 2026-05-10 11:00 | purchase | Emma | London | United Kingdom | 5102 | Jeans | L | 59.0000 |
| 2026-05-11 15:05 | refund | Emma | London | United Kingdom | 5102 | Jeans | L | -29.0000 |
| 2026-05-20 14:40 | purchase | Hans | Berlin | Germany | 3104 | Trail Runner | 42 | 89.5000 |
| 2026-05-02 10:00 | purchase | Hans | Århus | Denmark | 4110 | Espresso Grinder | 249.0000 | |
| 2026-05-04 10:00 | purchase | Hans | Aarhus | DK | 7305 | Coffee Cup | 18.0000 | |
| 2026-05-12 08:00 | view | Noah | London | United Kingdom | 8402 | Aeropress | 0.0000 | |
| 2026-05-12 08:05 | add_to_cart | Noah | London | UK | 8402 | Aeropress | 0.0000 | |
| 2026-05-12 08:10 | purchase | Noah | London | Canada | 8402 | Aeropress | 24.5000 | |
| 2026-05-12 08:25 | purchase | Noah | London | CA | 8601 | French Press | 34.0000 | |
| 2026-05-13 09:00 | refund | Noah | London | Canada | 8601 | French Press | -10.0000 | |
| 2026-05-14 11:00 | wishlist_add | Sofia | Paris | France | 8702 | Pour Over Kettle | 0.0000 | |
| 2026-05-14 11:03 | shipping_quote | Sofia | Paris | France | 8702 | Pour Over Kettle | 0.0000 | |
| 2026-05-14 11:10 | purchase | Sofia | Paris | Denmark | 8702 | Pour Over Kettle | 44.0000 | |
| 2026-05-15 18:30 | purchase | Morgane | Paris | France | 8402 | Aeropress | 24.5000 | |
| 2026-05-17 07:55 | purchase | Michael | San Jose | United States | 7305 | Coffee Cup | 19.9900 | |
| 2026-05-17 08:02 | refund | Michael | San José | USA | 7305 | Coffee Cup | -19.9900 | |
| 2026-05-19 10:00 | purchase | Vlad | Praha | Czechia | 8805 | Burr Grinder | 99.0000 | |
| 2026-05-19 10:10 | refund | Vlad | Prague | Czech Republic | 8805 | Burr Grinder | -20.0000 |
There are clearly patterns here that both humans and machines can spot easily.
We have data, but we have not yet captured its meaning.
Entities and Entity Types from Observations
Let us ask ChatGPT a question about our data, a question whose answer is obvious to a human, but which can confuse the machine:
Me: "Given the data above, how confident are you that cities belong to countries?"
Depending on how you prompt your LLM, your mileage will vary. I get this:
ChatGPT: Short answer: low confidence overall (~55-65%), with strong variance by row.
... blah blah blah...
Confidence interpretation:
City field itself: high confidence (~90%)
Country field: low confidence (~55%)
City-country pairing: moderate-low (~60%)
The LLM tries to infer meaning from observations that do not have a clearly defined structure.
Data represented purely as observations leads to ambiguity and bad reasoning. Speaking only about observations is like speaking about the weather: it carries no meaning. Intuitively, we know this: naming "things" and using those names consistently to communicate is how we make sense of the world.
Data modelling is the discipline of organising observations and data into "types of things":
In our sample, we can identify various "things", for example:
| "Things" | "Type of Thing" |
|---|---|
| "Michael", "Lisa", "Hans" | User |
| "Germany", "Denmark", "USA" | Country |
| "London", "Berlin", "Aarhus", "Paris" | City |
| "Winter Jacket", "French Press", "Espresso Grinder" | Product |
| "view", "purchase", "add_to_cart" | Action |
Terms like "Things" and "Types of Thing" are a bit clunky, so let me introduce some terminology:
- Entity Type ::= Type of thing
- The term we use to group "things" together into general types of those things
- Ex: User, Country
- When I talk about entity types, I will use capitalised Boldface.
- Ex: User, Country
- Entity ::= Thing
- The specific thing
- Ex: The User called "Michael" and the User called "Lisa"
- I will talk about these by "quoting" them. Ex: "Lisa"
- If I want to clearly call out what entity type an entity has, I will use this notation: "Lisa": User (inspired by Python)
Note that even if we know nothing about a problem domain, we can still identify entity types by looking at data and generalise from what we see.
LLMs are particularly good at this kind of inference and can help us model data.
Attributes and Relationships in Entity Types
Once we have identified entity types, we can refine our understanding and ask: "What values in our observations belong to each entity type?"
For example, we can ask: "What values belong to Users?" Our answer could be:
- name
- city
- country
Here, we notice that there is something different about name and city/country.
- name is specific to the User entity type and not itself an entity type
- We shall call these: attributes
- I will refer to attributes with this notation: User.name - the attribute name in the entity type User
- city appears to indicate that another entity type exists called City
- The entity type User is related to the City entity type.
- We shall call these relationships between entity types: relationships
- We refer to relationships with this notation: User.city -> City.
- Which means: The city value in Entity Type User is a relationship to the Entity Type City
Narrowing meaning with Relationships and Attributes
By listing the attributes and relationships for entity type User we are forced to think carefully about what our observations actually mean.
- Is User.city -> City defined as the place the user lives?
- Perhaps it is the shipping address used for things the user buys?
- Maybe it is the location the user is browsing from?
- Consider "Vlad": User, who has two different User.country -> Country values "Czechia" and "Czech Republic"
- Are these two different "Vlad" who spell their country differently?
- Or are they the same person who entered the country name in an unvalidated form?
- Is there a way to tell?
- "2001": Product.sku appears to have two values: "Winter Jacket" and "Alpine Shell"
- Do these represent the same entity of type Product, but with different names?
- ...Or was the Product.sku reused and they are entirely different entities?
- What entities of type Country actually exist and which ones are the same but with different names?
- Are "United States" and "USA" the same Country entity?
- How about "United Kingdom" and "England"?
- There are two conflicting "Hans" entries:
- "Hans": User in "Aarhus": City
- "Hans": User in "Berlin": City
- Are they the same person or did "Hans" move to another City?
- Is "Paris": City located in "France": Country or located in "Denmark": Country?
- There is in fact a "Paris" in "Denmark" too (more about that in a later blog).
- If we want to report on our profit per Country:
- Do we mean the country the User lived in when the item was purchased?
- Or the Country the User currently lives in?
- Again, the data itself does not tell us - we must choose.
Data modelling is the act of making choices consciously instead of hoping they emerge in observed data or that LLMs will magically recognise them.
Summary of Part 1
Today, we looked at a data example that we will use in the next part of this series. Hopefully, I have at least planted the idea that unmodelled observations are not good foundations for analytics - even with LLM support.
We then looked at how we can begin to categorise our data and make structured sense of it.
I introduced several concepts that we will be using in this series:
- Entity Types ::= The types of things we can identify in data
- Entity ::= The individual instances of entity types
- Attributes and relationships ::= The shape of entity types and how they relate to other entity types
Note that I picked a terminology that hopefully does not carry too much cognitive baggage from historical terms (though the above is somewhat inspired by E/R modelling).
In the next instalment of the series, we will use our new concepts to reason over the data.
See you soon.