GlossaryData modeling

Surrogate key

A warehouse-generated unique identifier for a dimension row — independent of any source system ID. It keeps joins stable when natural keys are messy, reused, or change.

A surrogate key is an artificial primary key you generate inside the warehouse, typically by hashing the natural key (and, for slowly changing dimensions, the valid-from date). It has no business meaning — its only job is to uniquely and stably identify a dimension row.

You use them because natural keys from source systems are unreliable: they get reused, they change format, they collide across systems, and for Type 2 dimensions a single natural key maps to multiple historical rows. A surrogate key sidesteps all of that and gives the fact table one clean column to join on.

In dbt the common approach is `dbt_utils.generate_surrogate_key([...])`, which produces a deterministic hash from the columns you specify.

Why it matters

Surrogate keys are what make Type 2 history and multi-source integration possible — they're the stable hinge every fact-to-dimension join swings on.

They also protect you from a whole class of production incidents where a source system 'helpfully' reuses or reformats an ID and quietly corrupts your joins.

Common mistakes
  • Joining facts to dimensions on natural keys, which breaks the moment a source reuses or changes them.
  • Generating non-deterministic surrogate keys (e.g. random UUIDs) so rebuilds produce different keys and break downstream references.
FAQ
Surrogate key vs natural key — what's the difference?
A natural key comes from the source system and carries business meaning (e.g. an email or order number). A surrogate key is a warehouse-generated, meaningless, stable identifier. Use surrogate keys for joins; keep natural keys as attributes.

Learn this by building, not memorizing.

Definitions get you the vocabulary. The platform gets you the skill — graded exercises, real projects, and a portfolio capstone.