In defense of the Natural Keys

Originally Posted on 1/21/2010 

Whenever I use the term "Natural key" in a tweet or on Facebook, someone always DM me to remind me how natural keys are worthless. I should use the brave and muscular surrogate keys for each and every task. 

Maybe is a frame of mind, but whenever there are strong technical debates I feel naturally inclined to the minority. 

In this case my point is that natural keys are perfectly adequate for many Business Intelligence tasks, and should be replaced by surrogate keys only when necessary.

If you are still with me, let's go. A bar counter definition of natural key is: "the code which uniquely identifies your record". Every transactional system assigns codes to its entities, customers, salesmen, products etc. The policy used to assign this code, is up to the system and I could not care less. Given that natural keys can a) change over time b) have a business meaning that can change over time as well c) not be the ideal field format for a join, often a surrogate key is introduced. A surrogate key is a key, usually numeric, unrelated with the natural, that uniquely identifies the record. That is, it does exactly the job already done by the natural key. (if you want a less mundane description of this subject you can find a good article here). This consideration alone should drive to the conclusion that the widespread use of surrogate keys in a datawarehouse is often overkill. I made literally tenths of systems with no surrogate keys which worked fine for years.

Often we think to DWs as multi terabytes systems which integrate data from hundredths of systems, but these are only a fraction of the total. There's also a midsize business market which needs the bulk of the abilities provided by a datawarehouse with far less data and far less systems to be integrated. In these cases, a simple approach, based on natural keys, may be perfectly suited. Building surrogate keys, is often a rather complex task, which slows development, hinders the long and delicate subsequent trimming work and slows loading.

More subtly, they're an entirely technical subject, hard to be understood by non technical project sponsors thus hard to justify on a project schedule. Surrogate Keys should be introduced when they are necessary and there are, indeed, cases where surrogate keys are highly recommended. In those multi terabytes DWs, where the smallest fact table has half billion records, performance becomes such an issue that the use of small numeric keys is practically mandatory. If many systems are integrated together in the DW, chances are that codes overlap, so an unrelated key is a good solution to that. Surrogate keys also shield DWs from changes of the naturals, whose propagation in a very large DW is an awful job. Also Surrogate Keys are a good artifact to factor in slowly changing dimensions. Etc. Etc. etc. So, the next time I tweet the term "Natural Key", please do not flood my account with "Surrogate Keys"!


# 1

No argument that SK should be used where necessary - e.g. for future-proofing and increasingly for confidentiality. Analysts need to list "candidates" and not identify them. Efficient production and maintenance of SK should be part of the architecture - there is no reason for it to cause inefficiency.

From  Kennth Hansen  (submitted  1/21/2010 @ 18:48:03)