Data modelling is the discipline of interpreting and extracting meaning by imposing structures on data. Like humans, LLMs benefit from such structure.
As we move into a future dominated by machines thinking for us, I expect that data modelling will be more important than ever.
Unfortunately, the discipline of data modelling is often obscured by changing terminologies that vary by vendor and by the never-ending hype cycles that plague our industry.
In this new blog series, I attempt to explain data modelling and data architecture in terms that can be understood by everyone - even if you have no formal training in computer science or relational algebra. This is no easy feat and it is one I take on with hesitation. It is my hope that as the series progresses, discussions will occur that help steer the course.
It is easier to talk about data modelling if we have a concrete example.
Imagine an online business with two divisions selling:
Our business wants to view all the data and analyse it across divisions. Both divisions load their data into a big table designed by a recently hired data expert.
This store is the most advanced. There is an automated checkout process in place built on a modern shopping platform.
The web store sells these products:
The apparel business has a busy marketing department, and they recently renamed their "Winter Jacket" to "Alpine Shell".
Customer addresses are entered via a validated web form.
The coffee store tracks all its products in an Excel spreadsheet and manually dispatches products to customers. Customers enter their address in a freeform text field.
Products sold are:
The price of "Coffee Cup" was recently increased.
The data engineer designed the following table to hold all information from our business. This is simply a raw representation of the input data (which is sometimes referred to as "Bronze" or "Staging").
| time | action | name | city | country | sku | product | size | value |
|---|---|---|---|---|---|---|---|---|
| 2026-05-01 09:00 | view | London | England | 2001 | Winter Jacket | M | 0.0000 | |
| 2026-05-01 09:10 | purchase | Lisa | Paris | Denmark | 2001 | Winter Jacket | M | 129.9900 |
| 2026-05-03 12:00 | refund | Lisa | Paris | Denmark | 2001 | Alpine Shell | M | -99.9900 |
| 2026-05-06 14:20 | add_to_cart | Emma | London | United Kingdom | 5102 | Jeans | L | 0.0000 |
| 2026-05-10 11:00 | purchase | Emma | London | United Kingdom | 5102 | Jeans | L | 59.0000 |
| 2026-05-11 15:05 | refund | Emma | London | United Kingdom | 5102 | Jeans | L | -29.0000 |
| 2026-05-20 14:40 | purchase | Hans | Berlin | Germany | 3104 | Trail Runner | 42 | 89.5000 |
| 2026-05-02 10:00 | purchase | Hans | Århus | Denmark | 4110 | Espresso Grinder | 249.0000 | |
| 2026-05-04 10:00 | purchase | Hans | Aarhus | DK | 7305 | Coffee Cup | 18.0000 | |
| 2026-05-12 08:00 | view | Noah | London | United Kingdom | 8402 | Aeropress | 0.0000 | |
| 2026-05-12 08:05 | add_to_cart | Noah | London | UK | 8402 | Aeropress | 0.0000 | |
| 2026-05-12 08:10 | purchase | Noah | London | Canada | 8402 | Aeropress | 24.5000 | |
| 2026-05-12 08:25 | purchase | Noah | London | CA | 8601 | French Press | 34.0000 | |
| 2026-05-13 09:00 | refund | Noah | London | Canada | 8601 | French Press | -10.0000 | |
| 2026-05-14 11:00 | wishlist_add | Sofia | Paris | France | 8702 | Pour Over Kettle | 0.0000 | |
| 2026-05-14 11:03 | shipping_quote | Sofia | Paris | France | 8702 | Pour Over Kettle | 0.0000 | |
| 2026-05-14 11:10 | purchase | Sofia | Paris | Denmark | 8702 | Pour Over Kettle | 44.0000 | |
| 2026-05-15 18:30 | purchase | Morgane | Paris | France | 8402 | Aeropress | 24.5000 | |
| 2026-05-17 07:55 | purchase | Michael | San Jose | United States | 7305 | Coffee Cup | 19.9900 | |
| 2026-05-17 08:02 | refund | Michael | San José | USA | 7305 | Coffee Cup | -19.9900 | |
| 2026-05-19 10:00 | purchase | Vlad | Praha | Czechia | 8805 | Burr Grinder | 99.0000 | |
| 2026-05-19 10:10 | refund | Vlad | Prague | Czech Republic | 8805 | Burr Grinder | -20.0000 |
There are clearly patterns here that both humans and machines can spot easily.
We have data, but we have not yet captured its meaning.
Let us ask ChatGPT a question about our data, a question whose answer is obvious to a human, but which can confuse the machine:
Me: "Given the data above, how confident are you that cities belong to countries?"
Depending on how you prompt your LLM, your mileage will vary. I get this:
ChatGPT: Short answer: low confidence overall (~55-65%), with strong variance by row.
... blah blah blah...
Confidence interpretation:
City field itself: high confidence (~90%)
Country field: low confidence (~55%)
City-country pairing: moderate-low (~60%)
The LLM tries to infer meaning from observations that do not have a clearly defined structure.
Data represented purely as observations leads to ambiguity and bad reasoning. Speaking only about observations is like speaking about the weather: it carries no meaning. Intuitively, we know this: naming "things" and using those names consistently to communicate is how we make sense of the world.
Data modelling is the discipline of organising observations and data into "types of things":
In our sample, we can identify various "things", for example:
| "Things" | "Type of Thing" |
|---|---|
| "Michael", "Lisa", "Hans" | User |
| "Germany", "Denmark", "USA" | Country |
| "London", "Berlin", "Aarhus", "Paris" | City |
| "Winter Jacket", "French Press", "Espresso Grinder" | Product |
| "view", "purchase", "add_to_cart" | Action |
Terms like "Things" and "Types of Thing" are a bit clunky, so let me introduce some terminology:
Note that even if we know nothing about a problem domain, we can still identify entity types by looking at data and generalise from what we see.
LLMs are particularly good at this kind of inference and can help us model data.
Once we have identified entity types, we can refine our understanding and ask: "What values in our observations belong to each entity type?"
For example, we can ask: "What values belong to Users?" Our answer could be:
Here, we notice that there is something different about name and city/country.
By listing the attributes and relationships for entity type User we are forced to think carefully about what our observations actually mean.
Data modelling is the act of making choices consciously instead of hoping they emerge in observed data or that LLMs will magically recognise them.
Today, we looked at a data example that we will use in the next part of this series. Hopefully, I have at least planted the idea that unmodelled observations are not good foundations for analytics - even with LLM support.
We then looked at how we can begin to categorise our data and make structured sense of it.
I introduced several concepts that we will be using in this series:
Note that I picked a terminology that hopefully does not carry too much cognitive baggage from historical terms (though the above is somewhat inspired by E/R modelling).
In the next instalment of the series, we will use our new concepts to reason over the data.
See you soon.