Previously in this series, we learned about Entities and Entity Types. But, I left a question open: How do we tell one entity apart from another?
In today's entry - we need to talk about keys and how they help us identify individual entities.
Recall our test data - we had two different "Hans": User:
We decided that these two entities are different entities - simply because they live in different cities. This choice seems arbitrary (and it is): Just because two User live in separate cities it does not follow they are different people in the real world. What would we do if "Hans" in "Berlin" moved to "Aarhus"?
We need another attribute of User that uniquely identifies a specific entity. Attributes that have the property of being unique to an entity are called: Keys.
Picking the right attribute of an entity type that unique identifies each entity is tricky. This isn't a database or IT specific problem - civilisations have struggled with this all through history.
Consider the problem of uniquely identifying humans. Roman legionaries would carry a signaculum to uniquely identify individuals, similar to the dog tags used by modern soldiers.
Today, we have better keys.
You could use my DNA sequence to identify me. But my full DNA string is several GB large. If I had an identical twin there would be two people out there with the same key.
Alternatively, we could use my fingerprint. Even if I had an identical twin brother, he would have different fingerprints. But it would not be practical if you had to store my fingerprint in every system that has a record about me. Collecting my fingerprint is a time consuming process requiring specialised equipment. We need a more compact and practical way to identify humans.
Governments and standard committees often take on the task of creating special, easy to remember, numbers that identify their citizens.
Consider the Danish system for assigning keys to citizens.
Danish citizens have a unique key called a "CPR number".
A Danish CPR number has the format: DDMMYY-SSSS.
DDMMYY encodes the date of birthThe Danish CPR number is an example of a "smart key". The structure of the key carries information about the entity it identifies.
At this point, you might have noticed something off about the Danish CPR system:
YY is only two digits, so we need extra interpretation rules to distinguish century.SSSS part) to identify the century are complex and reduce the amount of unique keys available.Smart keys are generally problematic, because they contain information about the entity that may change over time or may simply not be known when they are assigned.
It is hard to design a smart key which carries information which does not change. Because it isn't always clear what information may change.
The US Social Security Number uses this format: XXX-XX-XXXX. Like the Danish CPR number, the key is small enough for a human to remember and it can easily be stored in database systems.
Since SSN randomisation was introduced in June 2011, newly assigned SSNs no longer follow the older geographic allocation pattern. Unlike the Danish CPR number, a modern US SSN does not encode additional facts about the individual it refers to.
The SSN key is primarily just a value. This is an example of a "dumb key" - it carries no additional payload.
We are often better off using dumb keys to identify entities, and let the information that needs to be carried be part of other attributes of the entity, instead of using the key to encode it.
This is another case of the "rules vs structure" argument.
Compare the US Person entity using SSN type with the Danish Person tracking using CPR number:
Data model: Danish citizens
Data Model: US citizens
Unfortunately, dumb keys can introduce issues too.
Going back to Denmark, the system for assigning keys to Vehicles via license plates is an example of using dumb keys that are too dumb.
In the Danish system, license plates have this format: AA XX XXX, where A is a letter and X is a digit. The numbers are assigned sequentially, i.e. FL 00 000 is followed by FL 00 001, and FL 99 999 is followed by FM 00 000.
The total number of possible license plates in Denmark (using the 26 letters in the latin-9 alphabet) is:
Throughout history, more than 67M cars have been owned and registered in Denmark. This forces the reuse of license plates (though only when the old vehicle has been scrapped). But we now have database entries where a dumb key can refer to two completely different Vehicle entities.
The keys we find in the real world are typically called "natural keys". Examples include:
As we have now seen - such keys must be carefully designed to avoid problems with duplication, payload info, privacy and key space.
Sometimes, the same entity can have different keys referring to it. Consider the following case for me, the "Thomas Kejser": User entity. I have unique keys in two different countries:
What do you do, if you are data modelling a system where multiple keys refer to the same individual? Which one do you pick?
Cases exist where we do not have a natural key for an entity. For example: My dog, before he had his chip implanted, was still an individual animal - but he did not have a key (or even a name) yet.
As we have seen, humans are also quite poor at designing key systems in the first place.
Instead of trusting natural keys to exist and behave, it is often better to invent an entirely different key to represent entities in database systems you are designing. Such invented keys are called "Surrogate keys", and they have a lot of advantages:
Because of this, it is useful to assume that your data model should always use surrogate keys - unless you have a very good reason not to use them.
How do we model Entity Type User with surrogate keys if we also want to track the natural keys referring to the same entity?
We could have one column per natural key the user has. That would lead us to a design like this:
Table: User
| user_id | dk_cpr | ee_isikukood | user | dob |
|---|---|---|---|---|
| 42 | 221274-***1 | 3741222***6 | Thomas Kejser | 1974-12-22 |
The problem with this design is: What happens if a user does not have a key (isikukood) in "Estonia"? For example, my colleague "Torben": User lives in "Denmark" and has a User.cpr number there (which I don't know). But he has no User.isikukood in "Estonia".
Ex:
Table: User (columns for keys)
| user_id | dk_cpr | ee_isikukood | user | dob |
|---|---|---|---|---|
| 42 | 221274-***1 | 3741222***6 | Thomas Kejser | 1974-12-22 |
| 43 | 101076-***1 | What goes here? | Torben Mathiassen | 1976-10-10 |
What goes in User.isikukood for "Torben"? This is where databases typically sneak in the problem of NULL values.
Alternatively, if we know that keys are unique inside countries, we could create an alternative data model. We could say: "Natural keys referring to users are themselves an entity types". That would lead us to a model like this:
Table: User (one row per user)
| user_id | user | dob |
|---|---|---|
| 42 | Thomas Kejser | 1974-12-22 |
| 43 | Torben Mathiassen | 1976-10-10 |
Table: UserKey (one row per key that belongs to a user)
| user*id | country_id | key |
|---|---|---|
| 42 | 'dk' | 221274-***_1 |
| 42 | 'ee' | 3741222***6 |
| 43 | 'dk' | 101076-****\ |
We could further enhance this model by declaring: "The combination of UserKey.user_id and UserKey.country_id is unique". This adds additional meaning to the structure of the model.
Notice how these two different data models represent the same information, with different tradeoffs:
How do we pick between these two alternatives?
The model capturing the most semantic information is typically preferable. It is easy to transform from a rich semantic model (natural keys are rows) to one that has a less structure (natural keys as columns).
Consider this statement about the world:
If you agree, then the meaning is best captured in a model which has a UserKey table - not one that uses a column for each key.
Data modelers often fall into a trap: creating highly generic models capturing theoretical states of the world that simply don't exist - even though we could imagine they exist!
Such models tend to lose their semantic structure and fall back to rules for interpreting the data.
For example, assume we wanted to capture the height of User. Two standards exist for measuring height:
Here, we are again presented with two different model choices.
We could capture the height systems as columns:
Table: User
| user_id | user | dob | height_m | height_inches |
|---|---|---|---|---|
| 42 | Thomas Kejser | 1974-12-22 | 1.83 | 72 |
| 43 | Torben Mathiassen | 1976-10-10 | 1.88 | 74 |
But, a more "imaginative" architect and data modeler may say:
That leads to this model:
Table: User
| user_id | user | dob |
|---|---|---|
| 42 | Thomas Kejser | 1974-12-22 |
| 43 | Torben Mathiassen | 1976-10-10 |
Table: UserHeight
| user_id | height | unit |
|---|---|---|
| 42 | 1.83 | m |
| 42 | 72 | inches |
| 43 | 1.88 | m |
| 43 | 74 | inches |
Does this second model meaningfully capture information about the world worth tracking? Or: can we safely assume that, at least for now, there are only two ways to measure height worth modeling? (Japanese readers may now disagree and mention Shaku).
In today's data modeling blog, we talked about keys.
Some key insights to take away:
See you in the next blog, where we will dive into composite keys