Why bother with Data Modelling? - Part 1

Written by Database Doctor | Apr 2, 2026 7:17:31 PM

Data modelling is the discipline of interpreting and extracting meaning by imposing structures on data. Like humans, LLMs benefit from such structure.

As we move into a future dominated by machines thinking for us, I expect that data modelling will be more important than ever.

Unfortunately, the discipline of data modelling is often obscured by changing terminologies that vary by vendor and by the never-ending hype cycles that plague our industry.

In this new blog series, I attempt to explain data modelling and data architecture in terms that can be understood by everyone - even if you have no formal training in computer science or relational algebra. This is no easy feat and it is one I take on with hesitation. It is my hope that as the series progresses, discussions will occur that help steer the course.

Example Data Problem

It is easier to talk about data modelling if we have a concrete example.

Imagine an online business with two divisions selling:

Apparel
Coffee Products

Our business wants to view all the data and analyse it across divisions. Both divisions load their data into a big table designed by a recently hired data expert.

Apparel Store

This store is the most advanced. There is an automated checkout process in place built on a modern shopping platform.

The web store sells these products:

Winter Jacket
Trail Runner
Jeans

The apparel business has a busy marketing department, and they recently renamed their "Winter Jacket" to "Alpine Shell".

Customer addresses are entered via a validated web form.

Coffee Store

The coffee store tracks all its products in an Excel spreadsheet and manually dispatches products to customers. Customers enter their address in a freeform text field.

Products sold are:

Coffee Cup
Aeropress
Espresso Grinder
French Press
Pour Over Kettle
Burr Grinder

The price of "Coffee Cup" was recently increased.

Observations

The data engineer designed the following table to hold all information from our business. This is simply a raw representation of the input data (which is sometimes referred to as "Bronze" or "Staging").

time	action	name	city	country	sku	product	size	value
2026-05-01 09:00	view		London	England	2001	Winter Jacket	M	0.0000
2026-05-01 09:10	purchase	Lisa	Paris	Denmark	2001	Winter Jacket	M	129.9900
2026-05-03 12:00	refund	Lisa	Paris	Denmark	2001	Alpine Shell	M	-99.9900
2026-05-06 14:20	add_to_cart	Emma	London	United Kingdom	5102	Jeans	L	0.0000
2026-05-10 11:00	purchase	Emma	London	United Kingdom	5102	Jeans	L	59.0000
2026-05-11 15:05	refund	Emma	London	United Kingdom	5102	Jeans	L	-29.0000
2026-05-20 14:40	purchase	Hans	Berlin	Germany	3104	Trail Runner	42	89.5000
2026-05-02 10:00	purchase	Hans	Århus	Denmark	4110	Espresso Grinder		249.0000
2026-05-04 10:00	purchase	Hans	Aarhus	DK	7305	Coffee Cup		18.0000
2026-05-12 08:00	view	Noah	London	United Kingdom	8402	Aeropress		0.0000
2026-05-12 08:05	add_to_cart	Noah	London	UK	8402	Aeropress		0.0000
2026-05-12 08:10	purchase	Noah	London	Canada	8402	Aeropress		24.5000
2026-05-12 08:25	purchase	Noah	London	CA	8601	French Press		34.0000
2026-05-13 09:00	refund	Noah	London	Canada	8601	French Press		-10.0000
2026-05-14 11:00	wishlist_add	Sofia	Paris	France	8702	Pour Over Kettle		0.0000
2026-05-14 11:03	shipping_quote	Sofia	Paris	France	8702	Pour Over Kettle		0.0000
2026-05-14 11:10	purchase	Sofia	Paris	Denmark	8702	Pour Over Kettle		44.0000
2026-05-15 18:30	purchase	Morgane	Paris	France	8402	Aeropress		24.5000
2026-05-17 07:55	purchase	Michael	San Jose	United States	7305	Coffee Cup		19.9900
2026-05-17 08:02	refund	Michael	San José	USA	7305	Coffee Cup		-19.9900
2026-05-19 10:00	purchase	Vlad	Praha	Czechia	8805	Burr Grinder		99.0000
2026-05-19 10:10	refund	Vlad	Prague	Czech Republic	8805	Burr Grinder		-20.0000

There are clearly patterns here that both humans and machines can spot easily.

We have data, but we have not yet captured its meaning.

Entities and Entity Types from Observations

Let us ask ChatGPT a question about our data, a question whose answer is obvious to a human, but which can confuse the machine:

Me: "Given the data above, how confident are you that cities belong to countries?"

Depending on how you prompt your LLM, your mileage will vary. I get this:

ChatGPT: Short answer: low confidence overall (~55-65%), with strong variance by row.

... blah blah blah...

Confidence interpretation:

City field itself: high confidence (~90%)

Country field: low confidence (~55%)

City-country pairing: moderate-low (~60%)

The LLM tries to infer meaning from observations that do not have a clearly defined structure.

Data represented purely as observations leads to ambiguity and bad reasoning. Speaking only about observations is like speaking about the weather: it carries no meaning. Intuitively, we know this: naming "things" and using those names consistently to communicate is how we make sense of the world.

Data modelling is the discipline of organising observations and data into "types of things":

In our sample, we can identify various "things", for example:

"Things"	"Type of Thing"
"Michael", "Lisa", "Hans"	User
"Germany", "Denmark", "USA"	Country
"London", "Berlin", "Aarhus", "Paris"	City
"Winter Jacket", "French Press", "Espresso Grinder"	Product
"view", "purchase", "add_to_cart"	Action

Terms like "Things" and "Types of Thing" are a bit clunky, so let me introduce some terminology:

Entity Type ::= Type of thing
- The term we use to group "things" together into general types of those things
- Ex: User, Country
- When I talk about entity types, I will use capitalised Boldface.
- Ex: User, Country
Entity ::= Thing
- The specific thing
- Ex: The User called "Michael" and the User called "Lisa"
- I will talk about these by "quoting" them. Ex: "Lisa"
- If I want to clearly call out what entity type an entity has, I will use this notation: "Lisa": User (inspired by Python)

Note that even if we know nothing about a problem domain, we can still identify entity types by looking at data and generalise from what we see.

LLMs are particularly good at this kind of inference and can help us model data.

Attributes and Relationships in Entity Types

Once we have identified entity types, we can refine our understanding and ask: "What values in our observations belong to each entity type?"

For example, we can ask: "What values belong to Users?" Our answer could be:

name
city
country

Here, we notice that there is something different about name and city/country.

name is specific to the User entity type and not itself an entity type
- We shall call these: attributes
- I will refer to attributes with this notation: User.name - the attribute name in the entity type User
city appears to indicate that another entity type exists called City
- The entity type User is related to the City entity type.
- We shall call these relationships between entity types: relationships
- We refer to relationships with this notation: User.city -> City.
- Which means: The city value in Entity Type User is a relationship to the Entity Type City

Narrowing meaning with Relationships and Attributes

By listing the attributes and relationships for entity type User we are forced to think carefully about what our observations actually mean.

Is User.city -> City defined as the place the user lives?
- Perhaps it is the shipping address used for things the user buys?
- Maybe it is the location the user is browsing from?
Consider "Vlad": User, who has two different User.country -> Country values "Czechia" and "Czech Republic"
- Are these two different "Vlad" who spell their country differently?
- Or are they the same person who entered the country name in an unvalidated form?
- Is there a way to tell?
"2001": Product.sku appears to have two values: "Winter Jacket" and "Alpine Shell"
- Do these represent the same entity of type Product, but with different names?
- ...Or was the Product.sku reused and they are entirely different entities?
What entities of type Country actually exist and which ones are the same but with different names?
- Are "United States" and "USA" the same Country entity?
- How about "United Kingdom" and "England"?
There are two conflicting "Hans" entries:
- "Hans": User in "Aarhus": City
- "Hans": User in "Berlin": City
- Are they the same person or did "Hans" move to another City?
Is "Paris": City located in "France": Country or located in "Denmark": Country?
- There is in fact a "Paris" in "Denmark" too (more about that in a later blog).
If we want to report on our profit per Country:
- Do we mean the country the User lived in when the item was purchased?
- Or the Country the User currently lives in?
- Again, the data itself does not tell us - we must choose.

Data modelling is the act of making choices consciously instead of hoping they emerge in observed data or that LLMs will magically recognise them.

Summary of Part 1

Today, we looked at a data example that we will use in the next part of this series. Hopefully, I have at least planted the idea that unmodelled observations are not good foundations for analytics - even with LLM support.

We then looked at how we can begin to categorise our data and make structured sense of it.

I introduced several concepts that we will be using in this series:

Entity Types ::= The types of things we can identify in data
Entity ::= The individual instances of entity types
Attributes and relationships ::= The shape of entity types and how they relate to other entity types

Note that I picked a terminology that hopefully does not carry too much cognitive baggage from historical terms (though the above is somewhat inspired by E/R modelling).

In the next instalment of the series, we will use our new concepts to reason over the data.

See you soon.

View full post