Skip to content

Features

Why Column-Level Lineage Matters for dbt: Comparing the Options

Column-level lineage was a hot topic during 2024, with many data platforms and tools adding this feature to their software and boasting the benefits of column-level lineage for data workflows.

In this article, I’ll take a look at how column-level lineage works in dbt, in both open-source and dbt Cloud, and through SQLMesh; and see how the usage and workflow differs between these platforms.

Column-level lineage in dbt Explorer showing columns as nodes
Column-level lineage in dbt Explorer showing columns as nodes

Seeing lineage in your transformation platform, especially during data modeling updates, is particularly important because it’s at this stage that you most need to understand the impact of your actions.

Use Histogram Overlay and Top-K Charts to Understand Data Change in dbt

Data profiling stats are a really efficient way to get an understanding of the distribution of data in a dbt model. You can immediately see skewed data and spot data outliers, something which is difficult to do when checking data at the row level. Here's how Recce can help you make the most of these high-level data stats:

Visualize data change with histogram and top-k charts

Profiling stats become even more useful when applied to data change validation. Let’s say you’ve updated a data model in dbt and changed the calculation logic for a column — how can you get an overview of how the data was changed or impacted? This is where checking the top-k values, or the histogram, of before-and-after you made the changes, comes in handy — But there’s one major issue...

The best way to visualize data change in a histogram chart
The best way to visualize data change in a histogram chart

Something’s not right

If you generate a histogram graph from prod data, then do the same for your dev branch, you’ve got two distinct graphs. The axes don’t match, and it’s difficult to compare:

Hands-On Data Impact Analysis for dbt Data Projects with Recce

dbt data projects aren’t getting any smaller and, with the increasing complexity of DAGs, properly validating your data modeling changes has become a difficult task. The adoption of best practices such as data project pull request templates, and other ‘pull request guard rails’ has increased merge times and prolonged the QA process for pull requests.

Validate data modeling changes in dbt projects by comparing two environments with Recce
Validate data modeling changes in dbt projects by comparing two environments with Recce

The difficulty comes from your responsibility to check not only the model SQL code, but also the data, which is a product of your code. Even when code looks right, silent errors and hard to notice bugs can make their way into the data. A proper pull request review is not complete with data validation.

Next-Level Data Validation Toolkit for dbt Data Projects — Introducing Recce

Build the ultimate PR comment to validate your data modeling changes
Recce: Data Validation Toolkit for dbt

Validating data modeling changes and reviewing pull requests for dbt projects can be a challenging task. The difficulty of performing a proper ‘code review’ for data projects, due to both the code and data needing review, means the data validation stage is often omitted, poorly implemented, or drastically slows down time-to-merge for your time sensitive data updates.

How can you maintain data best practices, but speed up the validation and review process?