How Recce Performs Column-Level Lineage: Our Approach to SQL Transformations
Understanding how each column in a data model is derived is a critical part of building trustworthy data pipelines. Column-level lineage (CLL) provides visibility into how a model’s columns are constructed, which upstream columns they depend on, and how those columns are transformed along the way.
Column lineage is particularly useful for the following data engineering tasks:
- Source Exploration: During development, column-level lineage helps you understand how a column is derived.
- Impact Analysis: When modifying the logic of a column, column-level lineage enables you to assess the potential impact across the entire DAG.
- Root Cause Analysis: When tracing errors, column-level lineage helps identify the possible source of errors.
In this article, we’ll take a closer look at how Recce builds column-level lineage under the hood, how we classify transformations, and other challenges we encountered when building this feature.
How Recce Constructs the CLL graph
In Recce, when user selects a column in a model, the lineage DAG highlights the upstream and downstream model nodes that are connected through the dependencies of their related columns.
To enable this, Recce builds column-level dependencies by tracking the origin of every column in the model, and shows exactly which upstream columns contribute to the output, and how.
For example, given the following SQL:
SELECT
user_id,
order_total * 1.05 AS order_total_with_tax
FROM {{ ref("orders") }}
Recce would classify:
user_id
as a pass-through column, because it’s selected directly without modification.order_total_with_tax
as a derived column, because it’s computed using an arithmetic expression.
Pass-through and derived are just two of he column transformation types detected in Recce. See below for the full list.
Classifying Column Transformations
Recce parses each model’s SQL using an abstract syntax tree (AST) and resolves column expressions to their corresponding upstream inputs. During this process, we map each output column to its source(s) and classify the transformation using a deterministic set of rules.
Transformation Types
Understanding the transformation types will help you determine which columns to focus on when inspecting data flow through the project. All examples below are shown using dbt syntax, which includes references that use Jinja templating syntax. For example {{ ref("model_name") }}
.
Each column is classified as one of the following transformation types:
1. Pass-through
A pass-through column is selected exactly as-is from upstream. There are no renames, calculations, or expressions involved.
Example:
SELECT user_id FROM {{ ref("users") }}
In this example, user_id
is directly passed through from the users
table.
2. Renamed
A renamed column is sourced from a single upstream column, but the output name is changed using an alias. There’s no transformation beyond renaming.
Example:
SELECT user_id AS id FROM {{ ref("users") }}
id
is just a renamed version of user_id
, with no computation applied.
3. Derived
A derived column is the result of an expression involving one or more upstream columns. This includes arithmetic operations, logical expressions, CASE WHEN
, function calls, or aggregation functions such as SUM()
, AVG()
, or COUNT()
.
Derived columns usually deserve closer attention because they introduce business logic into the model. By understanding how a column is calculated from upstream inputs, you can better assess its potential impact on downstream derived columns.
Example:
SELECT price * quantity AS total_amount FROM {{ ref("sales") }}
In this example, total_amount
is derived from both price
and quantity
.
4. Source
A source column is not based on any upstream column. Instead, it’s produced from a literal value, a constant function (like CURRENT_TIMESTAMP
), or other expressions with no column input.
Example:
SELECT CURRENT_TIMESTAMP AS created_at FROM {{ ref("some_model") }}
In this example, the output column is not dependent on upstream data.
5. Unknown
A column is classified as an unknown transformation type when we fail to parse the column logic, or if the logic involves non-SQL components like Python, unsupported Jinja macros, or ambiguous expressions.
Example:
SELECT a, b
FROM T1
JOIN T2 ON T1.id = T2.id
In this example, without table schemas, it’s unclear whether a and b come from T1, T2, or both. This ambiguity prevents Recce from accurately resolving the column’s origin, so the transformation type is marked as unknown.
Implementation Notes
Recce uses a SQL parser–based approach to analyze dbt models and extract column-level lineage information. Under the hood, SQLGlot is used to parse SQL into an Abstract Syntax Tree (AST), which we traverse to resolve column dependencies and transformation logic. This allows us to understand how each output column is constructed from upstream inputs.
The core of our implementation can be understood in two parts: traversing scopes and analyzing columns.
Traversing scopes
Recce builds column-level lineage by walking through each scope in a SQL model, including CTEs, subqueries, and the final SELECT. This is done using SQLGlot’s traverse_scope() utility.
In SQLGlot, a scope represents a logical block of SQL, such as a CTE, subquery, or root query, along with the locally available columns and expressions.
The example below shows how we parse a model, walk through each scope, and iterate over the selected columns:
import sqlglot
from sqlglot.optimizer.scope import traverse_scope
from sqlglot.expressions import Column
sql = """
WITH base AS (
SELECT user_id, revenue, email FROM raw_users
)
SELECT
user_id,
revenue * 1.1 AS adjusted_revenue,
LOWER(email) AS email_lower
FROM base
"""
# Traverse all scopes (including CTEs, subqueries, root query)
for scope in traverse_scope(sqlglot.parse_one(sql)):
# Iterate through each projected column in SELECT
for proj in scope.expression.selects:
# Analyze column dependency here...
Analyzing columns
For each selected column in a scope, Recce performs the following steps to build its lineage and assign a transformation type:
-
Extract column references from the column’s AST.
Recce identifies all referenced columns used in the expression.
-
Resolve dependencies for each reference:
- If the reference points to an external model (for example, from a ref()), Recce links it to the corresponding upstream column.
- If the reference comes from another scope within the same model (such as a CTE), Recce inherits the dependencies and transformation type from that previously analyzed column.
-
Classify the transformation type based on the structure of the expression:
- If the column is a pass-through, it inherits the upstream transformation type unchanged.
- If it is renamed, it inherits the transformation type from the upstream column, but is tagged as renamed only if the upstream column is a pass-through.
- Otherwise, Recce classifies the column as derived, source, or unknown, depending on whether the expression includes calculations, literals, or unsupported logic.
Key challenges
Here are a few key challenges we’ve tackled while building this feature:
- SELECT * Expansion: When encountering
SELECT *
, Recce performs column expansion by consulting the schema of the source model. This allows us to resolve into the explicit list of columns at analysis time, which is critical for downstream lineage accuracy. - Name Collisions: When columns share the same name across different scopes, we prioritize based on scope resolution.
- Macro Expansion: We preprocess each model with dbt to expand macros before analyzing the SQL. This allows us to evaluate the resulting SQL with full context and perform accurate semantic analysis.
Limitations
Recce currently focuses on column-to-column dependencies, meaning it tracks how individual columns are passed or transformed between models.
Recce does not capture model-to-column relationships. For example, when a column is used in a downstream model’s WHERE, GROUP BY, or ORDER BY clauses. These are semantic dependencies that affect model logic, even if the column isn’t directly selected or transformed.
In the below example, the orders
model depends on stg_orders.order_status
. Recce does not yet visualize this usage in the column-level lineage graph.
# models/marts/orders.sql
SELECT
user_id,
order_total * 1.05 AS order_total_with_tax,
order_status
FROM {{ ref("stg_orders") }}
WHERE order_status != 'removed'
Support for this is coming soon. This will provide a more complete view of how columns influence the behavior and logic of downstream models.
Closing Thoughts
Column-level lineage might sound like a niche feature, but it’s foundational for impact analysis, data quality checks, and debugging complex models.
We’re continuing to expand our support for more SQL dialects, macros, and edge cases. If you’re interested in contributing or have feedback, we’d love to hear from you.