GlossaryData pipelines

Idempotency

The property that running a pipeline (or rebuilding a model) multiple times produces the same result as running it once — no duplicates, no drift. A core requirement for reliable data.

A pipeline is idempotent if re-running it doesn't change the outcome. Reload yesterday's data twice and you should get one copy, not two. This matters because in the real world jobs fail halfway, get retried, and backfill overlapping windows — and none of that should corrupt your tables.

The usual enemy of idempotency is naive `INSERT` logic that appends every run. The fixes: build full-refresh tables from a deterministic SELECT, or use merge/upsert on a unique key for incremental models so re-processing a window replaces rather than duplicates.

Designing for idempotency is what lets you sleep through a 3 a.m. retry. It's also a favorite interview question because it separates people who've run production pipelines from people who haven't.

Why it matters

Production pipelines fail and retry constantly. Idempotency is what makes those retries safe instead of a source of silent duplication and drift — it's the difference between robust data infrastructure and a fragile one.

It's a strong interview signal: candidates who design for idempotency have clearly operated real pipelines.

Common mistakes
  • Append-only INSERT logic that duplicates rows on every re-run or backfill.
  • Non-deterministic transformations (e.g. relying on current_timestamp or random ordering) that produce different output each run.
FAQ
How do I make a data pipeline idempotent?
Build from deterministic SELECTs and use merge/upsert on a unique key (rather than blind INSERT), so re-processing the same window replaces rows instead of duplicating them.

Learn this by building, not memorizing.

Definitions get you the vocabulary. The platform gets you the skill — graded exercises, real projects, and a portfolio capstone.