Why bother with Data Modelling

Previously in this series, we learned about Entities and Entity Types. But, I left a question open: How do we tell one entity apart from another?

In today's entry - we need to talk about keys and how they help us identify individual entities.

Telling similar things apart

Recall our test data - we had two different "Hans": User:

One "Hans" lived in "Berlin"
The other "Hans" lives in "Aarhus" (In "Denmark").

We decided that these two entities are different entities - simply because they live in different cities. This choice seems arbitrary (and it is): Just because two User live in separate cities it does not follow they are different people in the real world. What would we do if "Hans" in "Berlin" moved to "Aarhus"?

We need another attribute of User that uniquely identifies a specific entity. Attributes that have the property of being unique to an entity are called: Keys.

Keys are not trivial

Picking the right attribute of an entity type that unique identifies each entity is tricky. This isn't a database or IT specific problem - civilisations have struggled with this all through history.

Consider the problem of uniquely identifying humans. Roman legionaries would carry a signaculum to uniquely identify individuals, similar to the dog tags used by modern soldiers.

Today, we have better keys.

Super Keys: Biological Markers

You could use my DNA sequence to identify me. But my full DNA string is several GB large. If I had an identical twin there would be two people out there with the same key.

Alternatively, we could use my fingerprint. Even if I had an identical twin brother, he would have different fingerprints. But it would not be practical if you had to store my fingerprint in every system that has a record about me. Collecting my fingerprint is a time consuming process requiring specialised equipment. We need a more compact and practical way to identify humans.

Governments and standard committees often take on the task of creating special, easy to remember, numbers that identify their citizens.

Smart Keys

Consider the Danish system for assigning keys to citizens.

Danish citizens have a unique key called a "CPR number".

A Danish CPR number has the format: DDMMYY-SSSS.

DDMMYY encodes the date of birth
The last four digits form a running number
The final digit in the running number also indicates the gender marker at birth (even = female, odd = male)
The running number also helps distinguish century for people born more than 100 years apart

The Danish CPR number is an example of a "smart key". The structure of the key carries information about the entity it identifies.

The problem with Smart Keys

At this point, you might have noticed something off about the Danish CPR system:

What happens if too many people need identifiers with the same birth date?
For people who live more than 100 years - how do we know which year they are born?
- YY is only two digits, so we need extra interpretation rules to distinguish century.
- The rules for using the running number (the SSSS part) to identify the century are complex and reduce the amount of unique keys available.
If a person has gender reassignment surgery - what does the last digit now mean? (in practise, you get a new CPR number in this case)
Is it really your business to know how old I am to unique identify me in a database system?
What to do about people who for various reasons don't actually know their exact birthday and still need a CPR number assigned?

Smart keys are generally problematic, because they contain information about the entity that may change over time or may simply not be known when they are assigned.

It is hard to design a smart key which carries information which does not change. Because it isn't always clear what information may change.

Dumb Keys

The US Social Security Number uses this format: XXX-XX-XXXX. Like the Danish CPR number, the key is small enough for a human to remember and it can easily be stored in database systems.

Since SSN randomisation was introduced in June 2011, newly assigned SSNs no longer follow the older geographic allocation pattern. Unlike the Danish CPR number, a modern US SSN does not encode additional facts about the individual it refers to.

The SSN key is primarily just a value. This is an example of a "dumb key" - it carries no additional payload.

Dumb keys are better

We are often better off using dumb keys to identify entities, and let the information that needs to be carried be part of other attributes of the entity, instead of using the key to encode it.

This is another case of the "rules vs structure" argument.

Compare the US Person entity using SSN type with the Danish Person tracking using CPR number:

Data model: Danish citizens

Person.cpr (Key)
Rule for dob: DOB found by taking leftmost 6 digits of Person.CPR and using 7th - 10th digit to determine century
Rule for gender: If last digit is even: female, else Male

Data Model: US citizens

Person.ssn (Key)
Person.dob
Person.gender

Dumb keys can be too dumb

Unfortunately, dumb keys can introduce issues too.

Going back to Denmark, the system for assigning keys to Vehicles via license plates is an example of using dumb keys that are too dumb.

In the Danish system, license plates have this format: AA XX XXX, where A is a letter and X is a digit. The numbers are assigned sequentially, i.e. FL 00 000 is followed by FL 00 001, and FL 99 999 is followed by FM 00 000.

The total number of possible license plates in Denmark (using the 26 letters in the latin-9 alphabet) is:

100000 * 26 * 26 ~ 67M.

Throughout history, more than 67M cars have been owned and registered in Denmark. This forces the reuse of license plates (though only when the old vehicle has been scrapped). But we now have database entries where a dumb key can refer to two completely different Vehicle entities.

Natural Keys

The keys we find in the real world are typically called "natural keys". Examples include:

SSN / CPR and other national ID systems
Licence plates
SKU
Serial Numbers
IMEI numbers of mobile devices
Country Codes
Currency Codes

As we have now seen - such keys must be carefully designed to avoid problems with duplication, payload info, privacy and key space.

Surrogate Keys

Sometimes, the same entity can have different keys referring to it. Consider the following case for me, the "Thomas Kejser": User entity. I have unique keys in two different countries:

In Denmark I have a CPR number:
- User.CPR: "221274-***1"
In Estonia (where I live) I am identified by a key the Estonians call my Isikukood:
- User.Isikukood: "3741222***6"

What do you do, if you are data modelling a system where multiple keys refer to the same individual? Which one do you pick?

Cases exist where we do not have a natural key for an entity. For example: My dog, before he had his chip implanted, was still an individual animal - but he did not have a key (or even a name) yet.

As we have seen, humans are also quite poor at designing key systems in the first place.

Instead of trusting natural keys to exist and behave, it is often better to invent an entirely different key to represent entities in database systems you are designing. Such invented keys are called "Surrogate keys", and they have a lot of advantages:

You can pick a data type for the key that is compact (for example, a 64 bit integer)
- This speeds up join operations in the data model
- There are a lot of 64-bit integers
- The keys only need to be read by machines, so we are not under any obligation to have humans remember them
The key can be made in such a way that it can be generated independently by many agents without coordination
- Examples: UUID or Snowflake ID.
If designed properly, surrogate keys will never change and never be reused to represent another entity
- Saving your trouble of updating historical data
- Allowing you to handle cases where either smart keys or dumb keys change over time or get reused
You always have something that identifies an entity, even if you don't yet have another "real world" key identifying the entity
You can secure or mask personally identifiable data (such as birthdays) that may be present in smart keys
- while still exposing a key that can be used for joining and filtering when users consume your data model

Because of this, it is useful to assume that your data model should always use surrogate keys - unless you have a very good reason not to use them.

Example: Surrogate Key and Natural Key Modeling

How do we model Entity Type User with surrogate keys if we also want to track the natural keys referring to the same entity?

We could have one column per natural key the user has. That would lead us to a design like this:

Table: User

user_id	dk_cpr	ee_isikukood	user	dob
42	221274-***1	3741222***6	Thomas Kejser	1974-12-22

The problem with this design is: What happens if a user does not have a key (isikukood) in "Estonia"? For example, my colleague "Torben": User lives in "Denmark" and has a User.cpr number there (which I don't know). But he has no User.isikukood in "Estonia".

Ex:

Table: User (columns for keys)

user_id	dk_cpr	ee_isikukood	user	dob
42	221274-***1	3741222***6	Thomas Kejser	1974-12-22
43	101076-***1	What goes here?	Torben Mathiassen	1976-10-10

What goes in User.isikukood for "Torben"? This is where databases typically sneak in the problem of NULL values.

Alternatively, if we know that keys are unique inside countries, we could create an alternative data model. We could say: "Natural keys referring to users are themselves an entity types". That would lead us to a model like this:

Table: User (one row per user)

user_id	user	dob
42	Thomas Kejser	1974-12-22
43	Torben Mathiassen	1976-10-10

Table: UserKey (one row per key that belongs to a user)

user*id	country_id	key
42	'dk'	221274-***_1
42	'ee'	3741222***6
43	'dk'	101076-****\

We could further enhance this model by declaring: "The combination of UserKey.user_id and UserKey.country_id is unique". This adds additional meaning to the structure of the model.

Notice how these two different data models represent the same information, with different tradeoffs:

Represent id values from different countries as individual columns
- This introduces the NULL problem (what goes in the column if there is no country specific key?)
- For some people, it makes the model easier to query (it is just one table)
Represent id values as new rows, one per country.
- This introduces another table in the data model - making queries more complex
- But it captures important semantics in the model itself (ex: Keys are unique per country)
- Adding new country key systems is now trivial: just add more rows

How do we pick between these two alternatives?

The model capturing the most semantic information is typically preferable. It is easy to transform from a rich semantic model (natural keys are rows) to one that has a less structure (natural keys as columns).

Consider this statement about the world:

"Each country has their own ID assignment system and not all users have ID values in all countries"

If you agree, then the meaning is best captured in a model which has a UserKey table - not one that uses a column for each key.

Counterexample - Height of the User

Data modelers often fall into a trap: creating highly generic models capturing theoretical states of the world that simply don't exist - even though we could imagine they exist!

Such models tend to lose their semantic structure and fall back to rules for interpreting the data.

For example, assume we wanted to capture the height of User. Two standards exist for measuring height:

SI system (meters)
Imperial units (inches)

Here, we are again presented with two different model choices.

We could capture the height systems as columns:

Table: User

user_id	user	dob	height_m	height_inches
42	Thomas Kejser	1974-12-22	1.83	72
43	Torben Mathiassen	1976-10-10	1.88	74

But, a more "imaginative" architect and data modeler may say:

"Inches and Meters are just one example of units, there could be others"

That leads to this model:

Table: User

user_id	user	dob
42	Thomas Kejser	1974-12-22
43	Torben Mathiassen	1976-10-10

Table: UserHeight

user_id	height	unit
42	1.83	m
42	72	inches
43	1.88	m
43	74	inches

Does this second model meaningfully capture information about the world worth tracking? Or: can we safely assume that, at least for now, there are only two ways to measure height worth modeling? (Japanese readers may now disagree and mention Shaku).

Conclusion - Part 3: Keys

In today's data modeling blog, we talked about keys.

Some key insights to take away:

Dumb keys contain no information about the entity they refer to
Smart keys contain additional data about the entity referred
Natural keys are those designed by humans - and we are very poor at coming up with systems for generating natural keys that don't break
Surrogate keys are keys we invent for the systems we model - they are often preferable to natural keys
When modeling data - model that which is, not that which might be
- State what you believe to be true about the world right now, then structure the model to reflect that
- This creates richer model which more accurately reflect the reality you are currently in, not some theoretical reality that might exist in the future

See you in the next blog, where we will dive into composite keys

Why bother with Data Modelling - Part 3: Keys

Telling similar things apart

Keys are not trivial

Super Keys: Biological Markers

Smart Keys

The problem with Smart Keys

Dumb Keys

Dumb keys are better

Dumb keys can be too dumb

Natural Keys

Surrogate Keys

Example: Surrogate Key and Natural Key Modeling

Counterexample - Height of the User

Conclusion - Part 3: Keys