Skip to content

Build with complete SQL lineage

clpipe (Column-Level Pipeline) delivers complete lineage by default. It parses your SQL once and builds the complete graph - tables, columns, transformations, and all dependencies - automatically.

Wish column-level lineage? How about it is already built-in?


Your SQL Already Contains Everything

You write SQL files. Tables, columns, transformations, joins - it's all there in your code.

We parse it once. You get the complete graph.

from clpipe import Pipeline

# Point to your SQL files
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# That's it. Complete graph built automatically.

What's in the graph:

  • Tables and their dependencies
  • Columns and how they flow
  • Transformations (SUM, JOIN, CASE, COALESCE)
  • Expressions and calculations
  • Metadata from inline comments (descriptions, PII, ownership)
  • Everything your SQL already describes

Column lineage built-in. No upgrades. No extra cost.


Your Lineage, Your Control

Other tools lock your lineage in their platform. We give it to you.

from clpipe import Pipeline

# Your lineage lives in your code
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# It's yours - use it however you want
lineage_json = pipeline.to_json()      # Export anywhere
lineage_df = pipeline.to_dataframe()   # Analyze in pandas
metadata = pipeline.columns             # Query directly

# Integrate with any tool you choose
dag = pipeline.to_airflow_dag(...)                   # Airflow
# Custom integration with dbt, Dagster, or your own tools

No SaaS lock-in. No forced subscriptions. No vendor dependency.

You choose your tools:

  • Airflow - Built-in support
  • Your orchestrator - Bring your own (dbt, Dagster, Prefect)
  • Your data catalog - Export and integrate
  • Your custom tooling - Full API access

Your lineage stays with you, not locked in our platform.


What You Get Automatically

Question: "Where does dashboard.metrics.total_revenue come from?"

What you know:

dashboard.metrics ← analytics.revenue + dim.customers
analytics.revenue ← staging.orders
staging.orders ← raw.orders

What you DON'T know: - Which columns flow through? - What transformations happened? - Where did PII originate? - What breaks if I change a column?

Time to find answers: Hours of manual SQL reading

Question: "Where does dashboard.metrics.total_revenue come from?"

What you know:

from clpipe import Pipeline

pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# Trace any column instantly
sources = pipeline.trace_column_backward(
    "dashboard.metrics", "total_revenue"
)

# Output:
# raw.orders.amount → SUM(amount) → analytics.revenue.total → dashboard.metrics.total_revenue

What you ALSO know (built-in): - ✅ Ultimate source: raw.orders.amount - ✅ Transformation: SUM(amount) - ✅ All intermediate steps - ✅ PII propagation path - ✅ Exact impact of changes

Time: 0.003 seconds


Beyond Traditional Lineage

The complete SQL graph unlocks capabilities far beyond just tracing columns:

Automatic Metadata Propagation

Document where you define. Extract automatically.

# Add metadata in your SQL comments
sql = """
SELECT
  user_id,  -- User ID [pii: false]
  email,    -- Email [pii: true, owner: data-team]
  SUM(revenue) as total  /* Total revenue [tags: metric] */
FROM users
GROUP BY user_id, email
"""

pipeline = Pipeline([("query", sql)], dialect="bigquery")

# Metadata extracted automatically
email = pipeline.columns["query.email"]
print(email.description)  # "Email"
print(email.pii)          # True
print(email.owner)        # "data-team"

# Or set programmatically and propagate
pipeline.columns["raw.users.phone"].pii = True
pipeline.propagate_all_metadata()

# Query anywhere
pii_columns = pipeline.get_pii_columns()
metrics = pipeline.get_columns_by_tag("metric")

Governance that scales with your pipeline, not against it.

Precise Impact Analysis

Know exactly what breaks. Not "probably everything".

# Trace forward from ANY column
affected = pipeline.trace_column_forward(
    "raw.orders", "amount"
)

# Returns exact downstream impact:
# - staging.orders.total_amount (SUM aggregation)
# - analytics.revenue.total_revenue (direct flow)
# - dashboard.metrics.avg_order_value (calculation input)

Change with confidence.

Context-Aware LLM Documentation

The graph provides context. LLMs provide clarity.

from langchain_openai import ChatOpenAI

pipeline.llm = ChatOpenAI(model="gpt-4")
pipeline.generate_all_descriptions()

# LLM sees:
# - Source columns and types
# - Transformations (SUM, JOIN, CASE)
# - Filter conditions (WHERE clauses)
# - Aggregation logic (GROUP BY)

Documentation that understands your data flow.

Text-to-SQL Without Hallucination

The graph provides real context. No more made-up table names.

# Your graph knows:
# - Actual table names
# - Actual column names
# - Actual relationships
# - Actual transformations

# LLM query with context
query = "Show me total revenue by customer"

# Graph provides:
# - Tables: raw.orders, staging.orders, analytics.customer_metrics
# - Columns: customer_id, amount, total_amount
# - Relationships: customer_id joins, amount → SUM(amount)

# Result: Accurate SQL, no hallucinated columns

Context-aware queries. Zero hallucination.

Pipeline Execution

Graph → DAG. Deploy anywhere.

# Synchronous execution
results = pipeline.run(executor=my_executor, max_workers=4)

# Async execution
results = await pipeline.async_run(executor=my_async_executor)

# Airflow DAG
dag = pipeline.to_airflow_dag(
    executor=my_executor,
    dag_id="my_pipeline",
    schedule="@daily"
)

Write once. Execute everywhere.


Real-World Impact

  • 90% Time Savings


    Hours → Seconds for tracing data issues

  • 10x Faster Documentation


    LLM-powered generation with lineage context

  • 100% Compliance


    Automatic PII propagation

  • ~0% Additional Cost


    Point to your existing SQL, get everything


Get Started

# Basic installation
pip install clpipe

# With LLM support
pip install clpipe[llm]
from clpipe import Pipeline

# Parse your SQL pipeline
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# The complete graph is built automatically
# - Table lineage: pipeline.table_graph
# - Column lineage: pipeline.columns, pipeline.edges

# Trace any column
sources = pipeline.trace_column_backward("final_table", "metric")
affected = pipeline.trace_column_forward("source_table", "field")
pii_columns = pipeline.get_pii_columns()

Get Started Learn More


What Makes clpipe Different?

Feature Traditional Tools clpipe
Table lineage
Column lineage ❌ or Premium ✅ By Default
Transformation tracking
Metadata propagation
PII tracking Manual Automatic
Impact analysis Vague Exact
LLM documentation
Additional cost High ~0%

Built by data engineers, for data engineers.

View on GitHub