Skip to content

Usage

Identify and Automate Data Checks on Critical dbt Models

Do you know which are the critical models in your data project?

I’m sure the answer is yes. Even if you don’t rank models, you can definitely point to which models you should tread carefully around.

Do you check these critical models for data impact with every pull request?

Maybe some, but it’s probably on a more ad-hoc basis. If they really are critical models, you need to be aware of unintended impact. The last thing you want to do is mistakenly change historical metrics, or lose data.

Every dbt project has critical models
Impacted Lineage DAG from Recce showing modified and impacted models on the California Integrated Travel Project dbt project

Identifying critical models

Knowing the critical models in your project comes from your domain knowledge. You know these models have:

Use Histogram Overlay and Top-K Charts to Understand Data Change in dbt

Data profiling stats are a really efficient way to get an understanding of the distribution of data in a dbt model. You can immediately see skewed data and spot data outliers, something which is difficult to do when checking data at the row level. Here's how Recce can help you make the most of these high-level data stats:

Visualize data change with histogram and top-k charts

Profiling stats become even more useful when applied to data change validation. Let’s say you’ve updated a data model in dbt and changed the calculation logic for a column — how can you get an overview of how the data was changed or impacted? This is where checking the top-k values, or the histogram, of before-and-after you made the changes, comes in handy — But there’s one major issue...

The best way to visualize data change in a histogram chart
The best way to visualize data change in a histogram chart

Something’s not right

If you generate a histogram graph from prod data, then do the same for your dev branch, you’ve got two distinct graphs. The axes don’t match, and it’s difficult to compare: