dbt project best practices guide

1. Model the project as a journey from source-conformed to business-conformed data

Rule: Your dbt DAG should intentionally move from raw, source-shaped tables to clean business concepts. The core structure should be staging → intermediate → marts, not a flat collection of SQL files.

Why it matters: If the project does not encode this progression, every downstream BI dashboard, metric, and LLM-generated query has to guess whether a model is raw-ish, cleaned, reusable, or consumer-ready. That creates inconsistent joins, duplicated logic, and ambiguous definitions. dbt structure should be deliberately framed as the movement from many narrow source-conformed models toward fewer, wider business-conformed models.

Positive example:

models/
  staging/
    stripe/
      stg_stripe__payments.sql
  intermediate/
    finance/
      int_payments_pivoted_to_orders.sql
  marts/
    finance/
      orders.sql

Negative example:

models/
  orders_cleaned.sql
  orders_final.sql
  orders_dashboard.sql
  stripe_logic.sql

Warning: The folder structure is not cosmetic. It becomes a selection interface, a governance boundary, and a mental model for contributors.

2. Group staging models by source system, not by business team

Rule: In models/staging, subdirectories should represent source systems such as stripe, salesforce, jaffle_shop, or google_analytics.

Why it matters: Staging is where source data is standardized. Grouping staging by business function creates duplicate “atoms” before the project has established a single source of truth. Avoid marketing or finance folders in staging because that invites competing definitions of basic objects like orders or customers.

Positive example:

models/staging/stripe/stg_stripe__payments.sql
models/staging/jaffle_shop/stg_jaffle_shop__orders.sql

Negative example:

models/staging/finance/orders.sql
models/staging/marketing/orders.sql

Warning: This is one of the fastest ways to break semantic consistency. Once two teams stage “orders” differently, every metric built on top becomes politically and technically fragile.

3. Keep staging models boring: rename, recast, clean, categorize — but do not join or aggregate

Rule: A staging model should normally select from one source table and perform only reusable transformations: renaming, type casting, simple calculations, and basic categorizations.

Why it matters: Staging models are the reusable atomic layer. If you join or aggregate here, you silently change grain too early and remove optionality from downstream models. The recommended pattern is to select from one source, rename fields, recast types, and then build all later models on top of those standardized models.

Positive example:

select
    idas payment_id,
    orderidas order_id,
    amount/100.0as amount,
    created::timestampas created_at
from {{ source('stripe','payment') }}

Negative example:

select
    orders.customer_id,
    sum(payments.amount)as lifetime_value
from {{ source('stripe','payment') }}as payments
join {{ source('app','orders') }}as orders
on payments.order_id= orders.id
groupby1

Warning: Aggregation in staging is a grain violation. Joins in staging are usually a fanout risk. Use a base sublayer only when staging genuinely requires a source-level cleanup pattern, such as delete tables or unioning identical source structures.

4. Use `source()` only at the raw boundary and `ref()` everywhere else

Rule: Raw data should be declared as dbt sources and referenced through source(). Models should reference other models only through ref().

Why it matters: ref() lets dbt infer dependency order and keeps development and production environments isolated correctly. Direct database references bypass lineage, break environment portability, and make model dependencies invisible. Use ref() instead of direct relation references and limit raw references to one place.

Positive example:

from {{ref('stg_stripe__payments') }}

Negative example:

from analytics_prod.staging.stg_stripe__payments

Warning: Hard-coded relations are especially dangerous in CI, dev schemas, dbt Mesh, and downstream package consumption because the DAG no longer reflects the actual dependency.

5. Break complex transformations into intermediate models when they change grain, repeat logic, or hide fanout

Rule: Intermediate models should isolate meaningful transformation steps, especially joins, pivots, fanouts, aggregations, and reusable CTEs.

Why it matters: Complex models should be broken apart when CTEs are duplicated or when a CTE changes the grain of the data. Grain-changing logic deserves its own model because it should be testable and inspectable before it gets mixed into a mart.

Positive example:

int_payments_pivoted_to_orders.sql

This name tells you the input concept, the operation, and the output grain.

Negative example:

int_finance_logic.sql

This hides the transformation and gives no clue whether the model aggregates, joins, deduplicates, or fans out rows.

Warning: LLM-generated analytics queries are very sensitive to ambiguous grain. A model named customer_metrics could mean one row per customer, customer-day, customer-month, or customer-product unless the dbt layer makes grain explicit.

6. Name intermediate models with verbs, not vague nouns

Rule: Intermediate models should describe the transformation they perform: pivoted, aggregated_to, joined, fanned_out_by, deduplicated, ranked, sessionized.

Why it matters: Intermediate models should be named around verbs because this layer exists to perform purpose-built transformations on the way to marts. A good intermediate model name tells reviewers what changed without making them read the SQL first.

Positive examples:

int_payments_pivoted_to_orders.sql
int_orders_aggregated_to_customers.sql
int_events_sessionized.sql

Negative examples:

int_orders.sql
int_model_2.sql
finance_intermediate.sql

Warning: If an intermediate model cannot be named with a clear verb, it probably has too many responsibilities.

7. Build marts as business entities with one clear grain

Rule: A mart should represent a business entity or concept, such as orders, customers, payments, or accounts, with one row representing one instance of that entity.

Why it matters: Marts should be treated as business-defined entities; time-based rollups like orders_per_day are usually metrics, not pure marts. Avoid departmental duplicates like finance_orders and marketing_orders unless they are genuinely separate business concepts.

Positive example:

models/marts/finance/orders.sql
models/marts/marketing/customers.sql

Negative example:

finance_orders.sql
marketing_orders.sql
sales_orders.sql

Warning: Department-prefixed marts often encode political ownership rather than semantic difference. If finance revenue and product revenue differ, name the concepts precisely, such as tax_revenue versus recognized_revenue.

8. Denormalize marts only when you are not relying on the dbt Semantic Layer

Rule: Without the dbt Semantic Layer, wide denormalized marts are useful because BI users can query fewer joins. With the Semantic Layer, keep marts more normalized so MetricFlow has flexible dimensions and joins.

Why it matters: dbt projects often build highly denormalized datasets, but this limits the dimensionality available to MetricFlow. A more normalized approach is recommended when using the Semantic Layer.

Positive without Semantic Layer:

customers
  customer_id
  first_order_date
  lifetime_value
  number_of_orders

Positive with Semantic Layer:

customers
orders
payments

Then define semantic models and metrics on top.

Negative with Semantic Layer:

customer_metrics_obt

A single big rollup may be convenient for one dashboard but weak for reusable metrics.

Warning: This is a real architectural fork. Denormalized marts optimize human SQL and dashboard speed. Normalized semantic models optimize metric reuse and query generation flexibility.

9. Choose materializations by access pattern, not habit

Rule: Materialization should follow how the model is used:

staging      → view
intermediate → ephemeral or restricted view
marts        → table or incremental

Why it matters: Staging models should default to views because they are rarely queried directly and should stay fresh; intermediate models to ephemeral because they are implementation details; and marts to tables or incremental models because they are directly consumed and need performance.

Positive example:

models:
  jaffle_shop:
    staging:
      +materialized: view
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table

Negative example:

models:
  jaffle_shop:
    staging:
      +materialized: table
    marts:
      +materialized: view

Warning: Ephemeral models keep the warehouse clean but make debugging harder because they are compiled into downstream SQL. For complex intermediate logic, restricted views are often the better operational choice.

10. Configure defaults at the folder level and override only exceptions

Rule: Put default materializations, schemas, and other common configs in dbt_project.yml by directory. Use model-level config only for exceptions.

Why it matters: Project-level configuration is preferred because configs cascade, with more specific scopes overriding broader ones. This keeps the project DRY and avoids copy-pasted config blocks.

Positive example:

models:
  jaffle_shop:
    staging:
      +materialized: view
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      finance:
        +schema: finance

Negative example:

{{ config(materialized='table') }}

copied into every mart model.

Warning: Overusing tags to compensate for weak folder structure creates maintenance debt. Folders should be the primary grouping and selection mechanism; tags should represent exceptions.

11. Treat names as semantic contracts

Rule: Naming conventions should encode source, layer, entity, grain, and field meaning.

Recommended conventions:

stg_[source]__[entity]s.sql
int_[entity]s_[verb]s.sql
marts: [business_entity].sql
primary key: [object]_id
boolean: is_ / has_
timestamp: [event]_at
date: [event]_date

Also recommended: plural model names, snake_case, no abbreviations, no reserved words, and consistent key names across models.

Positive example:

customer_id,
is_completed_payment,
created_at,
created_date

Negative example:

id,
completed,
created,
cust_id

Warning: Bad names are not just a readability issue. They degrade BI self-service and LLM-based analytics because query generators infer meaning from names, joins, and grain.

12. Use YAML per folder, not one giant project file

Rule: Keep model and source YAML close to the models they describe. Use _[directory]__models.yml and, for staging folders, _[directory]__sources.yml.

Why it matters: A single monolithic YAML file centralizes config but makes specific tests, descriptions, and source definitions hard to find. One YAML file per model is searchable but creates too much file churn. Folder-level YAML is the best default balance.

Positive example:

models/staging/stripe/_stripe__sources.yml
models/staging/stripe/_stripe__models.yml
models/marts/finance/_finance__models.yml

Negative example:

models/schema.yml

containing every model, source, test, and description.

Warning: YAML structure affects test discoverability. If contributors cannot quickly find where a model is documented and tested, they will skip both.

13. Put tests where the modeling risk is, especially around keys and grain changes

Rule: Every important model should have a primary key, and transformations that change grain should be tested aggressively.

Why it matters: Each model should have a primary key, and CTEs that change grain should be broken out because those transformations are useful to test. Singular tests are valuable when testing interactions across specific models.

Positive examples:

models:
  - name: orders
    columns:
      - name: order_id
        data_tests:
          - unique
          - not_null

A fanout model should also test expected row counts or uniqueness at the new grain.

Negative example:

select*
from orders
join order_itemsusing (order_id)

with no test proving whether the output is one row per order or one row per order item.

Warning: Most metric bugs are not caused by obviously wrong SQL. They come from unnoticed grain changes, duplicate joins, and null-handling assumptions.

14. Use seeds only for small static lookup data, not ingestion

Rule: Seeds are for small CSV-backed lookup tables that do not exist in source systems. They are not a replacement for an EL tool.

Why it matters: Seeds are intended for lookup tables, not for loading source data into the warehouse.

Positive example:

seeds/utm_campaign_mappings.csv
seeds/zip_code_to_state.csv

Negative example:

seeds/orders.csv
seeds/payments.csv

used as recurring source ingestion.

Warning: Using seeds as ingestion hides lineage and operational freshness. dbt should transform data already in the warehouse, not act as your data loader.

15. Use Mesh only when ownership boundaries are real

Rule: Split dbt projects by durable domain ownership, governance needs, or scale — not merely by use case like “BI versus ML.”

Why it matters: dbt Mesh is useful for managing complexity across multiple interconnected projects, but interfaces should be defined based on teams, jobs, lineage, and selectors before splitting. Use groups and access controls first, with the most private access that preserves functionality.

Positive example:

customer_domain_project
finance_domain_project
foundational_sources_project

with public/protected interfaces.

Negative example:

bi_project
ml_project
executive_dashboard_project

all rebuilding the same customer and revenue concepts differently.

Warning: Mesh introduces intentional friction. That is good for governed interfaces, but wasteful if your team is just trying to organize a small project.

16. Use model contracts, access, and versions for stable cross-team interfaces

Rule: Once other teams or projects depend on a model, treat that model as an interface: define ownership, restrict access, use contracts, and version breaking changes.

Why it matters: Model contracts enforce structure and types, model versions manage breaking changes and migration windows, and access modifiers control whether models are private, protected, or public.

Positive example:

models:
  - name: orders
    config:
      access: public
      contract:
        enforced: true

Negative example:

models:
  - name: int_orders_joined_to_payments
    config:
      access: public

exposing implementation logic as a cross-team dependency.

Warning: Public models should be boring, stable, and documented. Do not publish intermediate implementation details just because another team asked for them once.

17. Keep semantic models and metrics in an intentional file structure

Rule: For the dbt Semantic Layer, either co-locate semantic YAML with each mart or create a dedicated models/semantic_models/ folder. For existing large projects, default to the dedicated folder.

Why it matters: Both options are valid; a dedicated subfolder is preferable when migrating existing projects because it makes it easier to see which marts have been codified into the Semantic Layer. Semantic models and metrics must live under model-paths, or the semantic manifest will be empty.

Positive example:

models/semantic_models/sem_orders.yml
models/semantic_models/sem_customers.yml

Negative example:

semantic_layer/orders.yml

outside model-paths.

Warning: This is not just organization. A misplaced semantic YAML file can mean your metrics are not recognized at all.

18. Use SQL style rules to protect review quality, not aesthetics

Rule: Adopt a style guide and automate it with SQLFluff or a formatter. The point is not prettiness; it is reducing review noise and making semantic mistakes visible.

Why it matters: Codify SQL style, use SQLFluff, avoid abbreviations, use explicit aliases, prefix columns when joining, and make join types explicit.

Positive example:

select
    orders.order_id,
    customers.customer_id

from orders

leftjoin customers
on orders.customer_id= customers.customer_id

Negative example:

select
    o.id,
    c.id
from orders o
join customers c
on o.customer= c.id

Warning: Ambiguous aliases and implicit joins make code review weaker. They also make it harder to spot fanout, wrong join direction, or accidental inner joins.

19. Separate development and production environments

Rule: Developers should run against a development target; production jobs should run against a production target.

Why it matters: Use separate development and production environments through dbt targets, plus version control and pull requests before merging to the production branch.

Positive example:

local development → dev target
scheduled job     → prod target

Negative example:

analyst laptop writes directly to production schemas

Warning: Environment discipline is part of semantic reliability. A trusted metric is not trusted if anyone can overwrite its production relation while experimenting.

20. Document deviations from the default conventions

Rule: Consistency matters more than the exact convention, but deviations must be explicit and documented.

Why it matters: The source repeatedly frames its conventions as strong defaults, not universal law. The important part is that teams understand why a structure exists and when it is safe to deviate.

Positive example:

We materialize selected intermediate models as views because they are used for
debugging complex grain-changing joins. They live in a restricted schema and
are not available to end users.

Negative example:

Some intermediate models are tables because the first developer preferred that.

Warning: Undocumented exceptions become fake patterns. New contributors will copy them, and the project will drift into inconsistency.

21. Design the mart first, then refactor upstream

Rule: Do not build dbt models strictly from left to right through the DAG. Start with the business output you need, prove the shape, then refactor logic back into staging and intermediate layers.

Why this matters: dbt project structure follows DAG order, but development usually should not. The recommended workflow is to mock the target output first, write the SQL needed to produce it, identify dependencies, then stage missing source tables and extract reusable logic upstream.

Good pattern:

1. Mock the target dashboard table or metric in a spreadsheet.
2. Write ugly-but-correct SQL to prove the output.
3. Identify source tables and grain.
4. Create missing staging models.
5. Move repeated or grain-changing logic into intermediate models.
6. Leave the mart clean and entity-grained.

Bad pattern:

1. Stage every source table.
2. Build many intermediate models speculatively.
3. Hope the final mart satisfies the stakeholder.

Warning: Premature DAG design creates beautiful unused models. The mart should justify the upstream abstractions.

22. Treat CTEs as miniature models

Rule: Inside a model, CTEs should tell the same story that the DAG tells across models: import, transform, join, aggregate, output.

Why this matters: Descriptive CTE names matter. A CTE should usually perform one logical unit of work, especially when the transformation affects grain or join behavior.

Good example:

with

ordersas (

select*from {{ref('stg_jaffle_shop__orders') }}

),

payments_pivoted_to_order_grainas (

select
        order_id,
        sum(amount)as total_amount

from {{ref('stg_stripe__payments') }}

groupby1

),

orders_joined_to_paymentsas (

select
        orders.order_id,
        orders.customer_id,
        payments_pivoted_to_order_grain.total_amount

from orders

leftjoin payments_pivoted_to_order_grain
on orders.order_id= payments_pivoted_to_order_grain.order_id

)

select*from orders_joined_to_payments

Bad example:

with xas (...),
yas (...),
finalas (...)
select*from final

Warning: If a CTE name cannot explain what changed, the model probably hides business logic.

23. Pull duplicated CTEs into intermediate models

Rule: If the same CTE appears in multiple models, it should usually become an intermediate model.

Why this matters: Repeated CTEs are semantic drift waiting to happen. One analyst changes the logic in one mart but not another, and suddenly “completed payments” or “active customers” means two different things.

Good pattern:

int_payments_filtered_to_completed.sql

used by:

orders.sql
customers.sql
revenue.sql

Bad pattern:

where statusin ('success','completed')

copy-pasted into five marts with small variations.

Warning: Copy-pasted CTEs are worse than copy-pasted SQL style violations. They duplicate definitions.

24. Aggregate before joining when possible

Rule: When joining fact-like tables, aggregate to the intended join grain before the join.

Why this matters: Aggregate on the smallest useful dataset before joining. This improves performance and reduces fanout risk.

Good example:

payments_aggregated_to_order_grainas (

select
        order_id,
        sum(amount)as total_amount

from payments

groupby1

)

select
    orders.order_id,
    payments_aggregated_to_order_grain.total_amount

from orders

leftjoin payments_aggregated_to_order_grain
on orders.order_id= payments_aggregated_to_order_grain.order_id

Bad example:

select
    orders.order_id,
    sum(payments.amount)as total_amount

from orders

leftjoin payments
on orders.order_id= payments.order_id

leftjoin order_items
on orders.order_id= order_items.order_id

groupby1

Warning: Joining two one-to-many relationships before aggregation is the classic fanout trap. It produces inflated metrics that look plausible.

25. Always make join direction obvious

Rule: Prefer left-to-right join logic. Start from the grain-defining table, then join supporting data onto it.

Why this matters: Avoid right join and use explicit join types. The table in the from clause should usually define the output grain.

Good example:

from orders

leftjoin customers
on orders.customer_id= customers.customer_id

This says: “one row per order, with customer attributes added.”

Bad example:

from customers

rightjoin orders
on customers.customer_id= orders.customer_id

The same result may be possible, but the model is harder to reason about.

Warning: LLM query generation and human reviewers both infer grain from the from table. Make that inference easy.

26. Prefix columns whenever a model joins tables

Rule: If a model joins two or more relations, every selected column should be prefixed with its source relation or CTE name.

Why this matters: Prefix columns in joined models. This prevents ambiguous references and makes lineage visible inside the SQL.

Good example:

select
    orders.order_id,
    orders.customer_id,
    customers.customer_name

from orders

leftjoin customers
on orders.customer_id= customers.customer_id

Bad example:

select
    order_id,
    customer_id,
    customer_name

after joining several CTEs.

Warning: Unprefixed columns make it harder to detect accidental column shadowing, especially when source systems reuse generic names like id, status, created_at, or type.

27. Avoid table aliases unless the alias is genuinely clearer

Rule: Do not use cryptic aliases like o, c, p, or fct.

Why this matters: Aliases reduce readability and should be avoided. In dbt, clarity usually beats short SQL.

Good example:

from orders

leftjoin customers
on orders.customer_id= customers.customer_id

Bad example:

from orders o

leftjoin customers c
on o.customer_id= c.customer_id

Trade-off: Short aliases may be acceptable in very long SQL expressions, but they should be meaningful. current_orders is better than co; customers is better than c.

28. Use `union all` by default

Rule: Use union all unless you explicitly need deduplication.

Why this matters: Prefer union all over union. union performs duplicate removal, which adds compute and can hide upstream data-quality issues.

Good example:

select*from shopify_us_orders

unionall

select*from shopify_eu_orders

Bad example:

select*from shopify_us_orders

union

select*from shopify_eu_orders

Warning: If duplicate rows appear, fix or test the duplicate condition explicitly. Do not let union silently erase evidence.

29. Use import CTEs to limit scan size

Rule: Top-level import CTEs should not blindly select * from large upstream models once the model is mature. Select only the columns and rows needed.

Why this matters: Place import CTEs at the top of the file and limit scanned data where possible.

Good example:

ordersas (

select
        order_id,
        customer_id,
        order_date,
        order_total

from {{ref('orders') }}

where order_date>='2024-01-01'

)

Bad example:

ordersas (

select*from {{ref('orders') }}

)

in a heavy mart that needs only four columns.

Trade-off: select * is acceptable during early development or in simple staging models. In expensive downstream transformations, it becomes wasteful.

30. Use the final `select * from final_cte` pattern while developing

Rule: End each model with a simple select * from the final output CTE.

Why this matters: This pattern makes debugging easier. You can temporarily change the last line to inspect an earlier CTE without rewriting the model.

Good example:

select*from orders_joined_to_payments

Debugging move:

select*from payments_aggregated_to_order_grain

Bad example:

select
    order_id,
    customer_id,
    total_amount
from (
    ...
)

Warning: Deeply nested SQL makes dbt models harder to audit. Prefer named CTEs that expose each transformation step.

31. Materialize intermediate models as views when debugging complexity

Rule: Ephemeral intermediate models are a good default, but complex or suspicious intermediate logic should be materialized as a restricted view during development or permanently if debugging value outweighs clutter.

Why this matters: Ephemerals keep the warehouse clean but make troubleshooting harder because they are interpolated into downstream SQL.

Good use of ephemeral:

Small reusable transformation
No need to inspect output
Low complexity

Good use of restricted view:

Changes grain
Contains window functions
Performs deduplication
Feeds several marts
Often breaks during development

Warning: A tidy warehouse is not worth opaque logic. If a model is critical to metric correctness, being able to inspect it may matter more than hiding it.

32. Temporarily build a failing model chain as tables to locate warehouse errors

Rule: When an error appears in a downstream mart but likely originates upstream, temporarily materialize the dependency chain as tables.

Why this matters: Troubleshoot by building specific model chains as tables so the warehouse throws the error where it actually occurs.

Good debugging move:

stg_orders       → table temporarily
int_order_items  → table temporarily
orders           → table temporarily

Bad debugging move:

Only inspect the final mart SQL and assume the error originates there.

Warning: View stacks and ephemerals can make database errors misleading. The visible failure point is not always the logical failure point.

33. Keep warehouse UX clean

Rule: The warehouse output is part of the user experience. Do not expose implementation models in schemas used by analysts, BI tools, or LLM agents.

Why this matters: The warehouse is one of the interfaces to the knowledge graph encoded in dbt. Intermediate models should not be exposed in the main production schema.

Good pattern:

analytics.finance.orders
analytics.marketing.customers
analytics_intermediate.finance.int_payments_pivoted_to_orders

Bad pattern:

analytics.orders
analytics.customers
analytics.int_payments_pivoted_to_orders
analytics.stg_stripe__payments
analytics.tmp_joined_orders_v2

Warning: BI users and LLM agents do not reliably know which tables are safe. If you expose staging and intermediate models in the same schema as marts, people will query the wrong thing.

34. Use `analyses` for audit and migration SQL, not production logic

Rule: Store version-controlled, Jinja-enabled audit queries in analyses, especially during migrations or reconciliations.

Why this matters: Use analyses for queries that should be version-controlled but not built into the warehouse. Audit-helper style workflows are a typical fit.

Good use:

analyses/compare_legacy_orders_to_dbt_orders.sql
analyses/audit_revenue_migration.sql

Bad use:

analyses/customer_lifetime_value.sql

where the query is actually needed by dashboards.

Warning: If an analysis becomes a recurring dependency, promote it to a model and test it.

35. Document macros once they become shared infrastructure

Rule: Any macro used by multiple models should have documented purpose and arguments.

Why this matters: Create _macros.yml and document macros when they are ready for use.

Good example:

macros:
  - name: cents_to_dollars
    description: Converts integer cents into decimal currency.
    arguments:
      - name: column_name
        type: column
        description: Column containing the amount in cents.

Bad example:

{{ cents_to_dollars('amount') }}

with no explanation of rounding, null behavior, currency assumptions, or type returned.

Warning: Macros hide logic. Hidden logic without documentation becomes a semantic black box.

36. Prefer packages for common tests and utilities before writing custom logic

Rule: Before writing a singular test or custom macro, check whether a mature dbt package already covers the use case.

Why this matters: Packages such as dbt-utils, dbt-expectations, dbt-codegen, and dbt-audit-helper cover common project needs.

Good pattern:

Use dbt_utils.unique_combination_of_columns
instead of writing a one-off uniqueness test for composite grain.

Bad pattern:

Every team writes its own slightly different accepted-values, recency, or relationship test.

Warning: Custom tests are useful, but they create maintenance obligations. Use them when the business rule is genuinely specific.

37. Use Codegen after you understand staging manually

Rule: Learn to write staging models by hand, then automate boilerplate staging and source YAML with Codegen.

Why this matters: Codegen is recommended for every project, but only after developers understand the staging pattern.

Good pattern:

First few source tables: write manually.
After pattern is clear: generate staging boilerplate.
Then customize exceptions.

Bad pattern:

Generate 200 staging models no one understands.

Warning: Codegen accelerates conventions. It does not design them for you.

38. Use snapshots only for true Type 2 history

Rule: Use snapshots when source data is destructively updated and you need historical versions.

Why this matters: Snapshots are the tool for creating Type 2 slowly changing dimension records from Type 1 source data.

Good use:

Track customer tier changes over time.
Track account owner changes over time.
Track subscription status history when the source overwrites status.

Bad use:

Snapshot every table just in case.

Warning: Snapshots add storage, complexity, and interpretation burden. If the source already provides historical events, model those events instead.

39. Use groups and access modifiers to encode ownership

Rule: As a project grows, define groups and set model access intentionally: private for implementation details, protected for cross-team reuse inside a project boundary, public for stable interfaces.

Why this matters: Groups, private models, and explicit cross-team collaboration matter most in dbt Mesh or multi-team projects.

Good pattern:

models:
  - name: int_payments_pivoted_to_orders
    config:
      group: finance
      access: private

  - name: orders
    config:
      group: finance
      access: public

Bad pattern:

Everything is public because someone might need it someday.

Warning: Public access is a promise. Do not make implementation details public unless you are ready to support them.

40. Version models only for breaking interface changes

Rule: Use model versions when a model’s structure, meaning, or contract changes in a way that would break consumers.

Why this matters: Versioned models and deprecation dates are part of managing model evolution.

Good use of versioning:

orders_v1: existing revenue logic
orders_v2: revised revenue recognition logic

Bad use of versioning:

orders_v2 because the SQL was refactored internally but output is unchanged

Warning: Versioning is not a substitute for clean migration planning. Add deprecation dates and communicate which version consumers should move to.

41. Do not split projects by consumption mode

Rule: Avoid splitting dbt projects into “BI project,” “ML project,” and “executive reporting project” if they reuse the same core concepts.

Why this matters: Avoid splitting by ML versus reporting use cases because it undermines the single source of truth.

Good split:

finance-owned project
customer-domain project
platform/foundational project

Bad split:

dashboard_project
ml_features_project
ad_hoc_analysis_project

Warning: ML features and BI metrics should often come from the same governed marts or semantic definitions. Otherwise the business will get two versions of “customer,” “revenue,” and “active user.”

42. Keep rollups out of marts when they are really metrics

Rule: Do not create marts like orders_per_day, revenue_by_month, or customers_by_region unless they are deliberate aggregate tables. Prefer entity-grained marts plus metrics.

Why this matters: Pure marts should not usually include time-based rollups; grouped outputs move past marts into metrics.

Good pattern:

orders            → one row per order
customers         → one row per customer
metric: revenue   → aggregated by requested dimensions

Bad pattern:

orders_per_day
orders_per_week
orders_by_region
orders_by_channel

Warning: Pre-aggregated marts multiply definitions. They also restrict future slicing because the grain has already been collapsed.

43. Make null handling explicit in marts

Rule: When joining optional downstream data into marts, use explicit coalesce only where the business meaning is clear.

Why this matters: Typical mart examples coalesce missing payment amounts and order counts to zero. That is correct when “no related rows” means zero, not unknown.

Good example:

coalesce(order_payments.total_amount,0)as total_amount

when no payment rows means no successful payments.

Risky example:

coalesce(customer_orders.lifetime_value,0)as lifetime_value

if missing orders could mean ingestion failure, privacy filtering, or an unresolved join key.

Warning: Nulls are semantic information. Replacing null with zero can make data quality problems invisible.

44. Use business terminology at the semantic boundary

Rule: Rename source terms into business terms as early as staging when the mapping is stable.

Why this matters: Use business terminology rather than source terminology, for example customer_id instead of user_id if the business calls them customers.

Good example:

idas customer_id

Bad example:

user_id

in some models and:

customer_id

in others for the same entity.

Warning: Inconsistent terminology breaks metric discoverability. It also confuses LLM-based querying because the same concept appears under multiple names.

45. Optimize for boring consistency, not cleverness

Rule: Prefer repetitive, predictable patterns over clever SQL structures.

Why this matters: Consistency is a functional requirement. dbt projects are collaborative semantic systems, not personal SQL notebooks.

Good pattern:

source CTE
renamed/transformed CTE
final select

used consistently across staging models.

Bad pattern:

Every model has a different style, CTE order, aliasing convention, and config location.

Warning: Cleverness compounds. A clever staging model becomes a confusing intermediate model, then an unreliable mart, then a disputed metric.

Downloads

1. Model the project as a journey from source-conformed to business-conformed data

2. Group staging models by source system, not by business team

3. Keep staging models boring: rename, recast, clean, categorize — but do not join or aggregate

4. Use source() only at the raw boundary and ref() everywhere else

5. Break complex transformations into intermediate models when they change grain, repeat logic, or hide fanout

6. Name intermediate models with verbs, not vague nouns

7. Build marts as business entities with one clear grain

8. Denormalize marts only when you are not relying on the dbt Semantic Layer

9. Choose materializations by access pattern, not habit

10. Configure defaults at the folder level and override only exceptions

11. Treat names as semantic contracts

12. Use YAML per folder, not one giant project file

13. Put tests where the modeling risk is, especially around keys and grain changes

14. Use seeds only for small static lookup data, not ingestion

15. Use Mesh only when ownership boundaries are real

16. Use model contracts, access, and versions for stable cross-team interfaces

17. Keep semantic models and metrics in an intentional file structure

18. Use SQL style rules to protect review quality, not aesthetics

19. Separate development and production environments

20. Document deviations from the default conventions

21. Design the mart first, then refactor upstream

22. Treat CTEs as miniature models

23. Pull duplicated CTEs into intermediate models

24. Aggregate before joining when possible

25. Always make join direction obvious

26. Prefix columns whenever a model joins tables

27. Avoid table aliases unless the alias is genuinely clearer

28. Use union all by default

29. Use import CTEs to limit scan size

30. Use the final select * from final_cte pattern while developing

31. Materialize intermediate models as views when debugging complexity

32. Temporarily build a failing model chain as tables to locate warehouse errors

33. Keep warehouse UX clean

34. Use analyses for audit and migration SQL, not production logic

35. Document macros once they become shared infrastructure

36. Prefer packages for common tests and utilities before writing custom logic

37. Use Codegen after you understand staging manually

38. Use snapshots only for true Type 2 history

39. Use groups and access modifiers to encode ownership

40. Version models only for breaking interface changes

41. Do not split projects by consumption mode

42. Keep rollups out of marts when they are really metrics

43. Make null handling explicit in marts

44. Use business terminology at the semantic boundary

45. Optimize for boring consistency, not cleverness

Related

Data modeling best practices for the modern analytics stack

Power BI development best practices guide

AI Can Write the Model — But Only After Humans Understand the Data

4. Use `source()` only at the raw boundary and `ref()` everywhere else

28. Use `union all` by default

30. Use the final `select * from final_cte` pattern while developing

34. Use `analyses` for audit and migration SQL, not production logic