Skip to content

Features

Use Histogram Overlay and Top-K Charts to Understand Data Change in dbt

Data profiling stats are a really efficient way to get an understanding of the distribution of data in a dbt model. You can immediately see skewed data and spot data outliers, something which is difficult to do when checking data at the row level. Here's how Recce can help you make the most of these high-level data stats:

Visualize data change with histogram and top-k charts

Profiling stats become even more useful when applied to data change validation. Let’s say you’ve updated a data model in dbt and changed the calculation logic for a column — how can you get an overview of how the data was changed or impacted? This is where checking the top-k values, or the histogram, of before-and-after you made the changes, comes in handy — But there’s one major issue...

The best way to visualize data change in a histogram chart
The best way to visualize data change in a histogram chart

Something’s not right

If you generate a histogram graph from prod data, then do the same for your dev branch, you’ve got two distinct graphs. The axes don’t match, and it’s difficult to compare:

Hands-On Data Impact Analysis for dbt Data Projects with Recce

dbt data projects aren’t getting any smaller and, with the increasing complexity of DAGs, properly validating your data modeling changes has become a difficult task. The adoption of best practices such as data project pull request templates, and other ‘pull request guard rails’ has increased merge times and prolonged the QA process for pull requests.

Validate data modeling changes in dbt projects by comparing two environments with Recce
Validate data modeling changes in dbt projects by comparing two environments with Recce

The difficulty comes from your responsibility to check not only the model SQL code, but also the data, which is a product of your code. Even when code looks right, silent errors and hard to notice bugs can make their way into the data. A proper pull request review is not complete with data validation.

Next-Level Data Validation Toolkit for dbt Data Projects — Introducing Recce

Build the ultimate PR comment to validate your data modeling changes
Recce: Data Validation Toolkit for dbt

Validating data modeling changes and reviewing pull requests for dbt projects can be a challenging task. The difficulty of performing a proper ‘code review’ for data projects, due to both the code and data needing review, means the data validation stage is often omitted, poorly implemented, or drastically slows down time-to-merge for your time sensitive data updates.

How can you maintain data best practices, but speed up the validation and review process?