Previously in this series, we learned about Entities and Entity Types. But, I left a question open: How do we tell one entity apart from another?
In today's entry - we need to talk about keys and how they help us identify individual entities.
Telling similar things apart
Recall our test data - we had two different "Hans": User:
- One "Hans" lived in "Berlin"
- The other "Hans" lives in "Aarhus" (In "Denmark").
We decided that these two entities are different entities - simply because they live in different cities. This choice seems arbitrary (and it is): Just because two User live in separate cities it does not follow they are different people in the real world. What would we do if "Hans" in "Berlin" moved to "Aarhus"?
We need another attribute of User that uniquely identifies a specific entity. Attributes that have the property of being unique to an entity are called: Keys.
Keys are not trivial
Picking the right attribute of an entity type that unique identifies each entity is tricky. This isn't a database or IT specific problem - civilisations have struggled with this all through history.
Consider the problem of uniquely identifying humans. Roman legionaries would carry a signaculum to uniquely identify individuals, similar to the dog tags used by modern soldiers.
Today, we have better keys.
Super Keys: Biological Markers
You could use my DNA sequence to identify me. But my full DNA string is several GB large. If I had an identical twin there would be two people out there with the same key.
Alternatively, we could use my fingerprint. Even if I had an identical twin brother, he would have different fingerprints. But it would not be practical if you had to store my fingerprint in every system that has a record about me. Collecting my fingerprint is a time consuming process requiring specialised equipment. We need a more compact and practical way to identify humans.
Governments and standard committees often take on the task of creating special, easy to remember, numbers that identify their citizens.
Smart Keys
Consider the Danish system for assigning keys to citizens.
Danish citizens have a unique key called a "CPR number".
A Danish CPR number has the format: DDMMYY-SSSS.
DDMMYYencodes the date of birth- The last four digits form a running number
- The final digit in the running number also indicates the gender marker at birth (even = female, odd = male)
- The running number also helps distinguish century for people born more than 100 years apart
The Danish CPR number is an example of a "smart key". The structure of the key carries information about the entity it identifies.
The problem with Smart Keys
At this point, you might have noticed something off about the Danish CPR system:
- What happens if too many people need identifiers with the same birth date?
- For people who live more than 100 years - how do we know which year they are born?
YYis only two digits, so we need extra interpretation rules to distinguish century.- The rules for using the running number (the
SSSSpart) to identify the century are complex and reduce the amount of unique keys available.
- If a person has gender reassignment surgery - what does the last digit now mean? (in practise, you get a new CPR number in this case)
- Is it really your business to know how old I am to unique identify me in a database system?
- What to do about people who for various reasons don't actually know their exact birthday and still need a CPR number assigned?
Smart keys are generally problematic, because they contain information about the entity that may change over time or may simply not be known when they are assigned.
It is hard to design a smart key which carries information which does not change. Because it isn't always clear what information may change.
Dumb Keys
The US Social Security Number uses this format: XXX-XX-XXXX. Like the Danish CPR number, the key is small enough for a human to remember and it can easily be stored in database systems.
Since SSN randomisation was introduced in June 2011, newly assigned SSNs no longer follow the older geographic allocation pattern. Unlike the Danish CPR number, a modern US SSN does not encode additional facts about the individual it refers to.
The SSN key is primarily just a value. This is an example of a "dumb key" - it carries no additional payload.
Dumb keys are better
We are often better off using dumb keys to identify entities, and let the information that needs to be carried be part of other attributes of the entity, instead of using the key to encode it.
This is another case of the "rules vs structure" argument.
Compare the US Person entity using SSN type with the Danish Person tracking using CPR number:
Data model: Danish citizens
- Person.cpr (Key)
- Rule for dob: DOB found by taking leftmost 6 digits of Person.CPR and using 7th - 10th digit to determine century
- Rule for gender: If last digit is even: female, else Male
Data Model: US citizens
- Person.ssn (Key)
- Person.dob
- Person.gender
Dumb keys can be too dumb
Unfortunately, dumb keys can introduce issues too.
Going back to Denmark, the system for assigning keys to Vehicles via license plates is an example of using dumb keys that are too dumb.
In the Danish system, license plates have this format: AA XX XXX, where A is a letter and X is a digit. The numbers are assigned sequentially, i.e. FL 00 000 is followed by FL 00 001, and FL 99 999 is followed by FM 00 000.
The total number of possible license plates in Denmark (using the 26 letters in the latin-9 alphabet) is:
- 100000 * 26 * 26 ~ 67M.
Throughout history, more than 67M cars have been owned and registered in Denmark. This forces the reuse of license plates (though only when the old vehicle has been scrapped). But we now have database entries where a dumb key can refer to two completely different Vehicle entities.
Natural Keys
The keys we find in the real world are typically called "natural keys". Examples include:
- SSN / CPR and other national ID systems
- Licence plates
- SKU
- Serial Numbers
- IMEI numbers of mobile devices
- Country Codes
- Currency Codes
As we have now seen - such keys must be carefully designed to avoid problems with duplication, payload info, privacy and key space.
Surrogate Keys
Sometimes, the same entity can have different keys referring to it. Consider the following case for me, the "Thomas Kejser": User entity. I have unique keys in two different countries:
- In Denmark I have a CPR number:
- User.CPR: "221274-***1"
- In Estonia (where I live) I am identified by a key the Estonians call my Isikukood:
- User.Isikukood: "3741222***6"
What do you do, if you are data modelling a system where multiple keys refer to the same individual? Which one do you pick?
Cases exist where we do not have a natural key for an entity. For example: My dog, before he had his chip implanted, was still an individual animal - but he did not have a key (or even a name) yet.
As we have seen, humans are also quite poor at designing key systems in the first place.
Instead of trusting natural keys to exist and behave, it is often better to invent an entirely different key to represent entities in database systems you are designing. Such invented keys are called "Surrogate keys", and they have a lot of advantages:
- You can pick a data type for the key that is compact (for example, a 64 bit integer)
- This speeds up join operations in the data model
- There are a lot of 64-bit integers
- The keys only need to be read by machines, so we are not under any obligation to have humans remember them
- The key can be made in such a way that it can be generated independently by many agents without coordination
- Examples: UUID or Snowflake ID.
- If designed properly, surrogate keys will never change and never be reused to represent another entity
- Saving your trouble of updating historical data
- Allowing you to handle cases where either smart keys or dumb keys change over time or get reused
- You always have something that identifies an entity, even if you don't yet have another "real world" key identifying the entity
- You can secure or mask personally identifiable data (such as birthdays) that may be present in smart keys
- while still exposing a key that can be used for joining and filtering when users consume your data model
Because of this, it is useful to assume that your data model should always use surrogate keys - unless you have a very good reason not to use them.
Example: Surrogate Key and Natural Key Modeling
How do we model Entity Type User with surrogate keys if we also want to track the natural keys referring to the same entity?
We could have one column per natural key the user has. That would lead us to a design like this:
Table: User
| user_id | dk_cpr | ee_isikukood | user | dob |
|---|---|---|---|---|
| 42 | 221274-***1 | 3741222***6 | Thomas Kejser | 1974-12-22 |
The problem with this design is: What happens if a user does not have a key (isikukood) in "Estonia"? For example, my colleague "Torben": User lives in "Denmark" and has a User.cpr number there (which I don't know). But he has no User.isikukood in "Estonia".
Ex:
Table: User (columns for keys)
| user_id | dk_cpr | ee_isikukood | user | dob |
|---|---|---|---|---|
| 42 | 221274-***1 | 3741222***6 | Thomas Kejser | 1974-12-22 |
| 43 | 101076-***1 | What goes here? | Torben Mathiassen | 1976-10-10 |
What goes in User.isikukood for "Torben"? This is where databases typically sneak in the problem of NULL values.
Alternatively, if we know that keys are unique inside countries, we could create an alternative data model. We could say: "Natural keys referring to users are themselves an entity types". That would lead us to a model like this:
Table: User (one row per user)
| user_id | user | dob |
|---|---|---|
| 42 | Thomas Kejser | 1974-12-22 |
| 43 | Torben Mathiassen | 1976-10-10 |
Table: UserKey (one row per key that belongs to a user)
| user*id | country_id | key |
|---|---|---|
| 42 | 'dk' | 221274-***_1 |
| 42 | 'ee' | 3741222***6 |
| 43 | 'dk' | 101076-****\ |
We could further enhance this model by declaring: "The combination of UserKey.user_id and UserKey.country_id is unique". This adds additional meaning to the structure of the model.
Notice how these two different data models represent the same information, with different tradeoffs:
- Represent id values from different countries as individual columns
- This introduces the NULL problem (what goes in the column if there is no country specific key?)
- For some people, it makes the model easier to query (it is just one table)
- Represent id values as new rows, one per country.
- This introduces another table in the data model - making queries more complex
- But it captures important semantics in the model itself (ex: Keys are unique per country)
- Adding new country key systems is now trivial: just add more rows
How do we pick between these two alternatives?
The model capturing the most semantic information is typically preferable. It is easy to transform from a rich semantic model (natural keys are rows) to one that has a less structure (natural keys as columns).
Consider this statement about the world:
- "Each country has their own ID assignment system and not all users have ID values in all countries"
If you agree, then the meaning is best captured in a model which has a UserKey table - not one that uses a column for each key.
Counterexample - Height of the User
Data modelers often fall into a trap: creating highly generic models capturing theoretical states of the world that simply don't exist - even though we could imagine they exist!
Such models tend to lose their semantic structure and fall back to rules for interpreting the data.
For example, assume we wanted to capture the height of User. Two standards exist for measuring height:
- SI system (meters)
- Imperial units (inches)
Here, we are again presented with two different model choices.
We could capture the height systems as columns:
Table: User
| user_id | user | dob | height_m | height_inches |
|---|---|---|---|---|
| 42 | Thomas Kejser | 1974-12-22 | 1.83 | 72 |
| 43 | Torben Mathiassen | 1976-10-10 | 1.88 | 74 |
But, a more "imaginative" architect and data modeler may say:
- "Inches and Meters are just one example of units, there could be others"
That leads to this model:
Table: User
| user_id | user | dob |
|---|---|---|
| 42 | Thomas Kejser | 1974-12-22 |
| 43 | Torben Mathiassen | 1976-10-10 |
Table: UserHeight
| user_id | height | unit |
|---|---|---|
| 42 | 1.83 | m |
| 42 | 72 | inches |
| 43 | 1.88 | m |
| 43 | 74 | inches |
Does this second model meaningfully capture information about the world worth tracking? Or: can we safely assume that, at least for now, there are only two ways to measure height worth modeling? (Japanese readers may now disagree and mention Shaku).
Conclusion - Part 3: Keys
In today's data modeling blog, we talked about keys.
Some key insights to take away:
- Dumb keys contain no information about the entity they refer to
- Smart keys contain additional data about the entity referred
- Natural keys are those designed by humans - and we are very poor at coming up with systems for generating natural keys that don't break
- Surrogate keys are keys we invent for the systems we model - they are often preferable to natural keys
- When modeling data - model that which is, not that which might be
- State what you believe to be true about the world right now, then structure the model to reflect that
- This creates richer model which more accurately reflect the reality you are currently in, not some theoretical reality that might exist in the future
See you in the next blog, where we will dive into composite keys