Clean up your DBT DAG’s with hidden ref() macro.

Daniel Wilczak
3 min readMay 31, 2023

--

Simplify denormalized linage by removing unneeded ref() on your DAG.

Problem we’re trying to solve:

There are multiple data modeling design patterns, of which the most common approaches involve denormalized and normalized tables. In a denormalized table, all of the relevant dimensional data (columns/attributes) are stored directly in the table. In a normalized table, only a reference key is stored in the fact table, which is used as a foreign key to other tables which store dimensional data. Regardless of which approach is used, it is generally considered best practice to do calculations using these keys where possible, because the keys contain all of the relevant information needed, tends to be more performant, etc.

Before using hidden ref() macro. Image by authors.

In DBT however, nodes are connected based on references, and there isn’t a good way to represent the fact that a key is equivalent to a dimensional model. For denormalized design patterns, this results in having all of the dimensional models connect to the final fact model when it finally extracts the dimensional data out of the key. This often shows up as a bunch of staging models reaching all the way across your DAG (Directed Acyclic Graph) to the final fact model, which makes the DAG unnecessarily confusing. An example of this can be seen above.

To simplify the DAG, the hidden ref macro was created, which allows DBT models to reference other models without having them show up on the DAG.

After using hidden ref() macro. Image by authors.

It’s simple to implement once the macro is created.

Code written by authors.

Macro Code:

The standard schema approach:

For a standard database, schema, and table alias, the ref function can be replicated simply using “database.schema.table”. Simply create a file named ref.sql in your macros folder and add the code below:

Code written by authors.

Custom schema approach:

To use hidden ref, it manually implements what the normal ref macro does without calling it at all, since calling it triggers Python code to add it to the DAG and the model’s dependencies (more on this later). We used a straightforward approach with brute forced yaml:

Code written by authors.

Usually only a few models will ever need to be added to this hardcoded yaml, since there are only so many dimensional models that are used, and in such a way that using hidden ref becomes beneficial.

Dev, test, prod approach:

Hiding references on the DAG is only useful when looking at the DAG in dev. By setting “target.name” for different environments, you can ensure certain environments use the original ref function instead of the custom hidden ref implementation. This ensures that even if a mistake is made with the hidden ref function in dev, your code should continue to work in test or production.

Code written by authors.

Contributions:

If you would like to add to the discussion or ask dbt to implement the feature. Join the dbt core discussion thread:

https://github.com/dbt-labs/dbt-core/discussions/7733

--

--

Daniel Wilczak
Daniel Wilczak

Written by Daniel Wilczak

Data engineer with a passion for machine learning.

No responses yet