Skip to content

2024

How to support self-serve data and analytics with comprehensive PR review

dbt, the platform that popularized ELT, has revolutionized the way data teams create and maintain data pipelines. The key is in the ‘T’, of ELT. Rather than transforming data before it hits the data warehouse, as in traditional ETL, dbt flips this and promotes loading raw data into your data warehouse and transforming it there, thus ELT.

This, along with bringing analytics inside the data project, makes dbt an interesting solution for data teams looking to maintain a single-source-of-truth(SSoT) for their data.

Move Fast and DON'T Break Prod
Move Fast and DON'T Break Prod

From DevOps to DataOps: A Fireside Chat on Practical Strategies for Effective Data Productivity

Top priorities for data-driven organizations are data productivity, cost reduction, and error prevention. The four strategies to improve DataOps are:

  1. start with small, manageable improvements,
  2. follow a clear blueprint,
  3. conduct regular data reviews, and
  4. gradually introduce best practices across the team.

In a recent fireside chat, CL Kao, founder of Recce, and Noel Gomez, co-founder of Datacoves, shared their combined experience of over two decades in the data and software industry. They discussed practical strategies to tackle these challenges, the evolution from DevOps to DataOps, and the need for companies to focus on data quality to avoid costly mistakes.

Firesidechat banner
Data Productivity - Beyonig DevOps & dbt

Identify and Automate Data Checks on Critical dbt Models

Do you know which are the critical models in your data project?

I’m sure the answer is yes. Even if you don’t rank models, you can definitely point to which models you should tread carefully around.

Do you check these critical models for data impact with every pull request?

Maybe some, but it’s probably on a more ad-hoc basis. If they really are critical models, you need to be aware of unintended impact. The last thing you want to do is mistakenly change historical metrics, or lose data.

Every dbt project has critical models
Impacted Lineage DAG from Recce showing modified and impacted models on the California Integrated Travel Project dbt project

Identifying critical models

Knowing the critical models in your project comes from your domain knowledge. You know these models have:

Use Histogram Overlay and Top-K Charts to Understand Data Change in dbt

Data profiling stats are a really efficient way to get an understanding of the distribution of data in a dbt model. You can immediately see skewed data and spot data outliers, something which is difficult to do when checking data at the row level. Here's how Recce can help you make the most of these high-level data stats:

Visualize data change with histogram and top-k charts

Profiling stats become even more useful when applied to data change validation. Let’s say you’ve updated a data model in dbt and changed the calculation logic for a column — how can you get an overview of how the data was changed or impacted? This is where checking the top-k values, or the histogram, of before-and-after you made the changes, comes in handy — But there’s one major issue...

The best way to visualize data change in a histogram chart
The best way to visualize data change in a histogram chart

Something’s not right

If you generate a histogram graph from prod data, then do the same for your dev branch, you’ve got two distinct graphs. The axes don’t match, and it’s difficult to compare:

Hands-On Data Impact Analysis for dbt Data Projects with Recce

dbt data projects aren’t getting any smaller and, with the increasing complexity of DAGs, properly validating your data modeling changes has become a difficult task. The adoption of best practices such as data project pull request templates, and other ‘pull request guard rails’ has increased merge times and prolonged the QA process for pull requests.

Validate data modeling changes in dbt projects by comparing two environments with Recce
Validate data modeling changes in dbt projects by comparing two environments with Recce

The difficulty comes from your responsibility to check not only the model SQL code, but also the data, which is a product of your code. Even when code looks right, silent errors and hard to notice bugs can make their way into the data. A proper pull request review is not complete with data validation.

Next-Level Data Validation Toolkit for dbt Data Projects — Introducing Recce

Build the ultimate PR comment to validate your data modeling changes
Recce: Data Validation Toolkit for dbt

Validating data modeling changes and reviewing pull requests for dbt projects can be a challenging task. The difficulty of performing a proper ‘code review’ for data projects, due to both the code and data needing review, means the data validation stage is often omitted, poorly implemented, or drastically slows down time-to-merge for your time sensitive data updates.

How can you maintain data best practices, but speed up the validation and review process?