Clean up your DBT DAG’s with hidden ref() macro.
Simplify denormalized linage by removing unneeded ref() on your DAG.
Problem we’re trying to solve:
There are multiple data modeling design patterns, of which the most common approaches involve denormalized and normalized tables. In a denormalized table, all of the relevant dimensional data (columns/attributes) are stored directly in the table. In a normalized table, only a reference key is stored in the fact table, which is used as a foreign key to other tables which store dimensional data. Regardless of which approach is used, it is generally considered best practice to do calculations using these keys where possible, because the keys contain all of the relevant information needed, tends to be more performant, etc.
In DBT however, nodes are connected based on references, and there isn’t a good way to represent the fact that a key is equivalent to a dimensional model. For denormalized design patterns, this results in having all of the dimensional models connect to the final fact model when it finally extracts the dimensional data out of the key. This often shows up as a bunch of staging models reaching all the way across your DAG (Directed Acyclic Graph) to the final fact model, which makes the DAG unnecessarily confusing. An example of this can be seen above.
To simplify the DAG, the hidden ref macro was created, which allows DBT models to reference other models without having them show up on the DAG.
It’s simple to implement once the macro is created.
Macro Code:
The standard schema approach:
For a standard database, schema, and table alias, the ref function can be replicated simply using “database.schema.table”. Simply create a file named ref.sql in your macros folder and add the code below:
Custom schema approach:
To use hidden ref, it manually implements what the normal ref macro does without calling it at all, since calling it triggers Python code to add it to the DAG and the model’s dependencies (more on this later). We used a straightforward approach with brute forced yaml:
Usually only a few models will ever need to be added to this hardcoded yaml, since there are only so many dimensional models that are used, and in such a way that using hidden ref becomes beneficial.
Dev, test, prod approach:
Hiding references on the DAG is only useful when looking at the DAG in dev. By setting “target.name” for different environments, you can ensure certain environments use the original ref function instead of the custom hidden ref implementation. This ensures that even if a mistake is made with the hidden ref function in dev, your code should continue to work in test or production.
Contributions:
- Jack Nguyen — Co-Author of the article.
- Michael Johnson (Dbt Labs)— For inspiring us to share the macro with the community.
If you would like to add to the discussion or ask dbt to implement the feature. Join the dbt core discussion thread: