Skip to content

2024

From DevOps to DataOps: A Fireside Chat on Practical Strategies for Effective Data Productivity

Top priorities for data-driven organizations are data productivity, cost reduction, and error prevention. The four strategies to improve DataOps are:

  1. start with small, manageable improvements,
  2. follow a clear blueprint,
  3. conduct regular data reviews, and
  4. gradually introduce best practices across the team.

In a recent fireside chat, CL Kao, founder of Recce, and Noel Gomez, co-founder of Datacoves, shared their combined experience of over two decades in the data and software industry. They discussed practical strategies to tackle these challenges, the evolution from DevOps to DataOps, and the need for companies to focus on data quality to avoid costly mistakes.

Firesidechat banner
Data Productivity - Beyonig DevOps & dbt

Identify and Automate Data Checks on Critical dbt Models

Do you know which are the critical models in your data project?

I’m sure the answer is yes. Even if you don’t rank models, you can definitely point to which models you should tread carefully around.

Do you check these critical models for data impact with every pull request?

Maybe some, but it’s probably on a more ad-hoc basis. If they really are critical models, you need to be aware of unintended impact. The last thing you want to do is mistakenly change historical metrics, or lose data.

Every dbt project has critical models
Impacted Lineage DAG from Recce showing modified and impacted models on the California Integrated Travel Project dbt project

Identifying critical models

Knowing the critical models in your project comes from your domain knowledge. You know these models have:

Self-Serve Review for Self-Serve Data

dbt has been revolutionary for data teams. ELT in general has changed the way we create and maintain data pipelines, but the biggest change has come in the form of ‘self-serve data’.

A self-serve data platform… supports creating new data products without the need for custom tooling or specialized knowledge. — dbt

The concept of self-serve data opened up access to the wider data team — You’ve probably found that all data roles in your team, from data engineer to data analyst, can have a hand in modifying and managing data, which brings new challenges to maintaining data stability.

Histogram side-by-side
Bring Self-Service Review to Self-Service Data

Data Incidents still happen

Even with all of the benefits that dbt brings bad merges still happen. Your code is version controlled, and you can track and review code changes easier than ever, but data projects bring a unique challenge — you need to review the code and the data.

Use Histogram Overlay and Top-K Charts to Understand Data Change in dbt

Data profiling stats are a really efficient way to get an understanding of the distribution of data in a dbt model. You can immediately see skewed data and spot data outliers, something which is difficult to do when checking data at the row level. Here's how Recce can help you make the most of these high-level data stats:

Visualize data change with histogram and top-k charts

Profiling stats become even more useful when applied to data change validation. Let’s say you’ve updated a data model in dbt and changed the calculation logic for a column — how can you get an overview of how the data was changed or impacted? This is where checking the top-k values, or the histogram, of before-and-after you made the changes, comes in handy — But there’s one major issue...

The best way to visualize data change in a histogram chart
The best way to visualize data change in a histogram chart

Something’s not right

If you generate a histogram graph from prod data, then do the same for your dev branch, you’ve got two distinct graphs. The axes don’t match, and it’s difficult to compare:

Hands-On Data Impact Analysis for dbt Data Projects with Recce

dbt data projects aren’t getting any smaller and, with the increasing complexity of DAGs, properly validating your data modeling changes has become a difficult task. The adoption of best practices such as data project pull request templates, and other ‘pull request guard rails’ has increased merge times and prolonged the QA process for pull requests.

Validate data modeling changes in dbt projects by comparing two environments with Recce
Validate data modeling changes in dbt projects by comparing two environments with Recce

The difficulty comes from your responsibility to check not only the model SQL code, but also the data, which is a product of your code. Even when code looks right, silent errors and hard to notice bugs can make their way into the data. A proper pull request review is not complete with data validation.

Next-Level Data Validation Toolkit for dbt Data Projects — Introducing Recce

Build the ultimate PR comment to validate your data modeling changes
Recce: Data Validation Toolkit for dbt

Validating data modeling changes and reviewing pull requests for dbt projects can be a challenging task. The difficulty of performing a proper ‘code review’ for data projects, due to both the code and data needing review, means the data validation stage is often omitted, poorly implemented, or drastically slows down time-to-merge for your time sensitive data updates.

How can you maintain data best practices, but speed up the validation and review process?