If you work in data and haven’t used dbt yet, you’ve almost certainly heard someone on your team talk about it. dbt (data build tool) has gone from a niche open-source project to the standard transformation layer in modern data stacks. Over 40,000 companies now use it, including JetBlue, Hubspot, and GitLab. But what is dbt as a data tool, and why has it become so central to how data teams operate? Here’s what you actually need to know.
What Is dbt and Why Does It Matter?
dbt stands for “data build tool.” It’s an open-source command-line tool that lets data analysts and engineers transform raw data inside your warehouse using SQL. That’s it. No new language to learn, no complex orchestration engine, no drag-and-drop GUI pretending to be code.
Before dbt, transformation logic lived in messy stored procedures, undocumented Python scripts, or worse, inside BI tool calculated fields that nobody could audit. dbt changed that by treating SQL transformations like software: version-controlled, tested, and documented.
The core idea is simple. You write SELECT statements. dbt handles the DDL and DML (the CREATE TABLE, INSERT, MERGE operations) for you. You define models as .sql files, dbt compiles them, determines the dependency order using a DAG (directed acyclic graph), and runs them against your warehouse.
How dbt Fits Into the Modern Data Stack
To understand dbt’s role, think about the three layers of a modern data pipeline:
- Extract and Load (EL): Tools like Fivetran, Airbyte, or Stitch pull raw data from source systems into your warehouse.
- Transform (T): This is where dbt sits. It takes raw, messy data and turns it into clean, reliable models analysts can query.
- Analytics and BI: Tools like Looker, Tableau, or Power BI sit on top, querying the transformed models.
This ELT pattern (extract, load, then transform) replaced the older ETL approach because cloud warehouses like Snowflake, BigQuery, and Databricks are cheap and fast enough to store raw data and transform it in place. dbt is the tool that made the “T” in ELT practical for teams of all sizes.
If you’re evaluating platforms for the warehouse layer, our comparison of data lakehouse architectures covers how these compute engines differ.
What dbt Actually Does: Core Features
SQL-Based Transformations
Every dbt model is a SELECT statement stored as a .sql file. You reference other models using the ref() function, and dbt builds the dependency graph automatically. If model B depends on model A, dbt runs A first. No manual orchestration needed.
Testing and Data Quality
dbt has built-in testing. You can add schema tests (unique, not_null, accepted_values, relationships) with a single line of YAML. Custom tests are just SQL queries that return failing rows. I’ve seen teams cut data quality incidents by 60-70% within three months of adding dbt tests to their pipelines.
Documentation That Stays Current
One of dbt’s best features is auto-generated documentation. You add descriptions in YAML files alongside your models, and dbt produces a browsable website showing every model, its columns, descriptions, and the full lineage graph. Because the docs live next to the code, they actually get updated (unlike that Confluence page from 2022 nobody maintains).
Jinja Templating and Macros
dbt uses Jinja (a Python templating language) inside SQL, letting you write DRY (Don’t Repeat Yourself) code. Common patterns like surrogate key generation, pivot tables, or SCD Type 2 logic can be abstracted into reusable macros. The dbt community has published hundreds of these in open-source packages.
Incremental Models
For large tables, dbt supports incremental materialisation. Instead of rebuilding a full table every run, it processes only new or changed rows. On a 500-million-row fact table, this can reduce run times from 45 minutes to under 3 minutes.
dbt Core vs dbt Cloud: Which Should You Use?
dbt comes in two flavours:
| Feature | dbt Core (Open Source) | dbt Cloud (SaaS) |
|---|---|---|
| Cost | Free | From $100/seat/month (Team plan) |
| Hosting | Self-managed (CLI) | Managed web IDE + scheduler |
| Scheduling | Bring your own (Airflow, cron, etc.) | Built-in job scheduler |
| IDE | VS Code or any editor | Browser-based IDE |
| CI/CD | You configure it | Slim CI built in |
| Semantic Layer | Not available | Included (MetricFlow) |
| Best for | Teams with strong engineering ops | Teams wanting faster setup |
In my experience, teams under 5 analysts do fine with dbt Core and Airflow. Once you hit 10+ people editing models, the collaboration features in dbt Cloud (especially the IDE and CI/CD) start paying for themselves quickly.
Who Uses dbt on a Data Team?
This is where dbt shifted the industry. Before dbt, transformation work was exclusively an engineering task. dbt made it accessible to analysts who know SQL but not Python or Spark.
On a typical team:
- Analytics engineers own the dbt project day-to-day. They build and maintain the models that power dashboards and reports.
- Data engineers handle the infrastructure: the warehouse, the EL pipelines, and the orchestration layer that triggers dbt runs. For more on this role, see our guide on data engineering careers.
- Data analysts consume the transformed models and sometimes contribute simpler models or tests.
- Data leaders benefit from the lineage and documentation when making decisions about data quality or platform investments.
The “analytics engineer” title barely existed before dbt. Now it’s one of the fastest-growing roles in data, with median salaries around $130,000-$160,000 in the US according to recent LinkedIn data.
Getting Started with dbt: A Realistic Timeline
If you’re considering adopting dbt, here’s what a realistic rollout looks like based on what I’ve seen work across mid-size teams:
Week 1-2: Install dbt Core, connect to your warehouse, build 3-5 staging models from your most important source tables.
Week 3-4: Add schema tests and documentation. Build your first “mart” model (a clean, business-ready table for a specific use case like revenue reporting).
Month 2: Migrate existing transformation logic from stored procedures or scripts into dbt. Set up CI/CD with GitHub or GitLab.
Month 3: Roll out to the broader team. Establish naming conventions, a PR review process, and a style guide.
The biggest mistake I see is trying to migrate everything at once. Start with one business domain, prove the value, then expand. Teams that try to boil the ocean end up with a half-finished dbt project and a half-maintained legacy pipeline, which is worse than either approach alone.
Common Pitfalls When Adopting dbt
After seeing dozens of dbt implementations, these are the patterns that cause the most pain:
- Too many models, no naming convention: Without a clear structure (staging, intermediate, mart layers), projects become impossible to maintain past 100 models.
- Skipping tests: dbt’s testing is its superpower. Teams that skip it end up with the same data quality problems they had before, just written in fancier SQL.
- Over-engineering with Jinja: Just because you can write complex Jinja macros doesn’t mean you should. If a colleague can’t read the compiled SQL, you’ve gone too far.
- Ignoring materialisation strategy: Running everything as table materialisations when views would suffice (or vice versa) either wastes compute or creates slow queries.
If you’re building the broader data engineering toolchain around dbt, getting these fundamentals right early saves significant rework later.
Frequently Asked Questions
Is dbt free to use?
dbt Core is completely free and open source under the Apache 2.0 licence. You can install it via pip and run it against any supported warehouse (Snowflake, BigQuery, Redshift, Databricks, Postgres, and others). dbt Cloud, the managed SaaS product, has a free developer tier for individuals and paid plans starting at $100 per seat per month for teams.
Do I need to know Python to use dbt?
No. dbt’s primary interface is SQL. If you can write SELECT statements, you can use dbt. Some advanced features use Jinja templating (which has a Python-like syntax), but most teams get 90% of the value using straightforward SQL with basic Jinja ref() and source() functions. dbt does support Python models for machine learning or complex transformations, but that’s optional.
How is dbt different from an ETL tool like Informatica or Talend?
Traditional ETL tools extract data, transform it in a separate processing engine, then load the results into a warehouse. dbt only handles the transformation step, and it runs inside your existing warehouse. It assumes data is already loaded (via Fivetran, Airbyte, or custom scripts). This “ELT” approach is simpler, cheaper, and takes advantage of your warehouse’s compute power rather than requiring a separate transformation server.
Can dbt replace Apache Airflow or other orchestrators?
Not exactly. dbt handles the dependency ordering of SQL transformations within a run, but it doesn’t handle cross-system orchestration. Most teams use Airflow, Dagster, or Prefect to trigger dbt runs as part of a larger pipeline (for example, “run Fivetran sync, then run dbt, then refresh the Looker cache”). dbt Cloud includes a built-in scheduler, which works for simpler setups, but complex pipelines still benefit from a dedicated orchestrator.
Ben is a full-time data leadership professional and a part-time blogger.
When he’s not writing articles for Data Driven Daily, Ben is a Head of Data Strategy at a large financial institution.
He has over 14 years’ experience in Banking and Financial Services, during which he has led large data engineering and business intelligence teams, managed cloud migration programs, and spearheaded regulatory change initiatives.