Or follow us on social media:
Floe is under active development and will enter Beta soon. We’d love for you to help us shape the world's best Lakehouse SQL compute engine.

Why bother with Data Modelling? - Part 1

by·

Data modelling is the discipline of interpreting and extracting meaning by imposing structures on data. Like humans, LLMs benefit from such structure.

As we move into a future dominated by machines thinking for us, I expect that data modelling will be more important than ever.

Unfortunately, the discipline of data modelling is often obscured by changing terminologies that vary by vendor and by the never-ending hype cycles that plague our industry.

In this new blog series, I attempt to explain data modelling and data architecture in terms that can be understood by everyone - even if you have no formal training in computer science or relational algebra. This is no easy feat and it is one I take on with hesitation. It is my hope that as the series progresses, discussions will occur that help steer the course.

Example Data Problem

It is easier to talk about data modelling if we have a concrete example.

Imagine an online business with two divisions selling:

  • Apparel
  • Coffee Products

Our business wants to view all the data and analyse it across divisions. Both divisions load their data into a big table designed by a recently hired data expert.

Apparel Store

This store is the most advanced. There is an automated checkout process in place built on a modern shopping platform.

The web store sells these products:

  • Winter Jacket
  • Trail Runner
  • Jeans

The apparel business has a busy marketing department, and they recently renamed their "Winter Jacket" to "Alpine Shell".

Customer addresses are entered via a validated web form.

Coffee Store

The coffee store tracks all its products in an Excel spreadsheet and manually dispatches products to customers. Customers enter their address in a freeform text field.

Products sold are:

  • Coffee Cup
  • Aeropress
  • Espresso Grinder
  • French Press
  • Pour Over Kettle
  • Burr Grinder

The price of "Coffee Cup" was recently increased.

Observations

The data engineer designed the following table to hold all information from our business. This is simply a raw representation of the input data (which is sometimes referred to as "Bronze" or "Staging").

time action name city country sku product size value
2026-05-01 09:00 view   London England 2001 Winter Jacket M 0.0000
2026-05-01 09:10 purchase Lisa Paris Denmark 2001 Winter Jacket M 129.9900
2026-05-03 12:00 refund Lisa Paris Denmark 2001 Alpine Shell M -99.9900
2026-05-06 14:20 add_to_cart Emma London United Kingdom 5102 Jeans L 0.0000
2026-05-10 11:00 purchase Emma London United Kingdom 5102 Jeans L 59.0000
2026-05-11 15:05 refund Emma London United Kingdom 5102 Jeans L -29.0000
2026-05-20 14:40 purchase Hans Berlin Germany 3104 Trail Runner 42 89.5000
2026-05-02 10:00 purchase Hans Århus Denmark 4110 Espresso Grinder   249.0000
2026-05-04 10:00 purchase Hans Aarhus DK 7305 Coffee Cup   18.0000
2026-05-12 08:00 view Noah London United Kingdom 8402 Aeropress   0.0000
2026-05-12 08:05 add_to_cart Noah London UK 8402 Aeropress   0.0000
2026-05-12 08:10 purchase Noah London Canada 8402 Aeropress   24.5000
2026-05-12 08:25 purchase Noah London CA 8601 French Press   34.0000
2026-05-13 09:00 refund Noah London Canada 8601 French Press   -10.0000
2026-05-14 11:00 wishlist_add Sofia Paris France 8702 Pour Over Kettle   0.0000
2026-05-14 11:03 shipping_quote Sofia Paris France 8702 Pour Over Kettle   0.0000
2026-05-14 11:10 purchase Sofia Paris Denmark 8702 Pour Over Kettle   44.0000
2026-05-15 18:30 purchase Morgane Paris France 8402 Aeropress   24.5000
2026-05-17 07:55 purchase Michael San Jose United States 7305 Coffee Cup   19.9900
2026-05-17 08:02 refund Michael San José USA 7305 Coffee Cup   -19.9900
2026-05-19 10:00 purchase Vlad Praha Czechia 8805 Burr Grinder   99.0000
2026-05-19 10:10 refund Vlad Prague Czech Republic 8805 Burr Grinder   -20.0000

There are clearly patterns here that both humans and machines can spot easily.

We have data, but we have not yet captured its meaning.

Entities and Entity Types from Observations

Let us ask ChatGPT a question about our data, a question whose answer is obvious to a human, but which can confuse the machine:

Me: "Given the data above, how confident are you that cities belong to countries?"

Depending on how you prompt your LLM, your mileage will vary. I get this:

ChatGPT: Short answer: low confidence overall (~55-65%), with strong variance by row.

... blah blah blah...

Confidence interpretation:

City field itself: high confidence (~90%) 

Country field: low confidence (~55%)

City-country pairing: moderate-low (~60%)

The LLM tries to infer meaning from observations that do not have a clearly defined structure.

Data represented purely as observations leads to ambiguity and bad reasoning. Speaking only about observations is like speaking about the weather: it carries no meaning. Intuitively, we know this: naming "things" and using those names consistently to communicate is how we make sense of the world.

Data modelling is the discipline of organising observations and data into "types of things":

In our sample, we can identify various "things", for example:

"Things" "Type of Thing"
"Michael", "Lisa", "Hans" User
"Germany", "Denmark", "USA" Country
"London", "Berlin", "Aarhus", "Paris" City
"Winter Jacket", "French Press", "Espresso Grinder" Product
"view", "purchase", "add_to_cart" Action

Terms like "Things" and "Types of Thing" are a bit clunky, so let me introduce some terminology:

  • Entity Type ::= Type of thing
    • The term we use to group "things" together into general types of those things
    • Ex: User, Country
    • When I talk about entity types, I will use capitalised Boldface.
    • Ex: User, Country
  • Entity ::= Thing
    • The specific thing
    • Ex: The User called "Michael" and the User called "Lisa"
    • I will talk about these by "quoting" them. Ex: "Lisa"
    • If I want to clearly call out what entity type an entity has, I will use this notation: "Lisa": User (inspired by Python)

Note that even if we know nothing about a problem domain, we can still identify entity types by looking at data and generalise from what we see.

LLMs are particularly good at this kind of inference and can help us model data.

Attributes and Relationships in Entity Types

Once we have identified entity types, we can refine our understanding and ask: "What values in our observations belong to each entity type?"

For example, we can ask: "What values belong to Users?" Our answer could be:

  • name
  • city
  • country

Here, we notice that there is something different about name and city/country.

  • name is specific to the User entity type and not itself an entity type
    • We shall call these: attributes
    • I will refer to attributes with this notation: User.name - the attribute name in the entity type User
  • city appears to indicate that another entity type exists called City
    • The entity type User is related to the City entity type.
    • We shall call these relationships between entity types: relationships
    • We refer to relationships with this notation: User.city -> City.
    • Which means: The city value in Entity Type User is a relationship to the Entity Type City

Narrowing meaning with Relationships and Attributes

By listing the attributes and relationships for entity type User we are forced to think carefully about what our observations actually mean.

  • Is User.city -> City defined as the place the user lives?
    • Perhaps it is the shipping address used for things the user buys?
    • Maybe it is the location the user is browsing from?
  • Consider "Vlad": User, who has two different User.country -> Country values "Czechia" and "Czech Republic"
    • Are these two different "Vlad" who spell their country differently?
    • Or are they the same person who entered the country name in an unvalidated form?
    • Is there a way to tell?
  • "2001": Product.sku appears to have two values: "Winter Jacket" and "Alpine Shell"
    • Do these represent the same entity of type Product, but with different names?
    • ...Or was the Product.sku reused and they are entirely different entities?
  • What entities of type Country actually exist and which ones are the same but with different names?
    • Are "United States" and "USA" the same Country entity?
    • How about "United Kingdom" and "England"?
  • There are two conflicting "Hans" entries:
    • "Hans": User in "Aarhus": City
    • "Hans": User in "Berlin": City
    • Are they the same person or did "Hans" move to another City?
  • Is "Paris": City located in "France": Country or located in "Denmark": Country?
    • There is in fact a "Paris" in "Denmark" too (more about that in a later blog).
  • If we want to report on our profit per Country:
    • Do we mean the country the User lived in when the item was purchased?
    • Or the Country the User currently lives in?
    • Again, the data itself does not tell us - we must choose.

Data modelling is the act of making choices consciously instead of hoping they emerge in observed data or that LLMs will magically recognise them.

Summary of Part 1

Today, we looked at a data example that we will use in the next part of this series. Hopefully, I have at least planted the idea that unmodelled observations are not good foundations for analytics - even with LLM support.

We then looked at how we can begin to categorise our data and make structured sense of it.

I introduced several concepts that we will be using in this series:

  • Entity Types ::= The types of things we can identify in data
  • Entity ::= The individual instances of entity types
  • Attributes and relationships ::= The shape of entity types and how they relate to other entity types

Note that I picked a terminology that hopefully does not carry too much cognitive baggage from historical terms (though the above is somewhat inspired by E/R modelling).

In the next instalment of the series, we will use our new concepts to reason over the data.

See you soon.

Author
Database Doctor
Follow us