Build with complete SQL lineage

clpipe (Column-Level Pipeline) delivers complete lineage by default. It parses your SQL once and builds the complete graph - tables, columns, transformations, and all dependencies - automatically.

Wish column-level lineage? How about it is already built-in?

Your SQL Already Contains Everything

You write SQL files. Tables, columns, transformations, joins - it's all there in your code.

We parse it once. You get the complete graph.

from clpipe import Pipeline

# Point to your SQL files
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# That's it. Complete graph built automatically.

What's in the graph:

Tables and their dependencies
Columns and how they flow
Transformations (SUM, JOIN, CASE, COALESCE)
Expressions and calculations
Metadata from inline comments (descriptions, PII, ownership)
Everything your SQL already describes

Column lineage built-in. No upgrades. No extra cost.

Your Lineage, Your Control

Other tools lock your lineage in their platform. We give it to you.

from clpipe import Pipeline

# Your lineage lives in your code
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# It's yours - use it however you want
lineage_json = pipeline.to_json()      # Export anywhere
lineage_df = pipeline.to_dataframe()   # Analyze in pandas
metadata = pipeline.columns             # Query directly

# Integrate with any tool you choose
dag = pipeline.to_airflow_dag(...)                   # Airflow
# Custom integration with dbt, Dagster, or your own tools

No SaaS lock-in. No forced subscriptions. No vendor dependency.

You choose your tools:

✅ Airflow - Built-in support
✅ Your orchestrator - Bring your own (dbt, Dagster, Prefect)
✅ Your data catalog - Export and integrate
✅ Your custom tooling - Full API access

Your lineage stays with you, not locked in our platform.

What You Get Automatically

Table-Level Graph OnlyComplete SQL Graph (clpipe)

Question: "Where does dashboard.metrics.total_revenue come from?"

What you know:

dashboard.metrics ← analytics.revenue + dim.customers
analytics.revenue ← staging.orders
staging.orders ← raw.orders

What you DON'T know: - Which columns flow through? - What transformations happened? - Where did PII originate? - What breaks if I change a column?

Time to find answers: Hours of manual SQL reading

Question: "Where does dashboard.metrics.total_revenue come from?"

What you know:

from clpipe import Pipeline

pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# Trace any column instantly
sources = pipeline.trace_column_backward(
    "dashboard.metrics", "total_revenue"
)

# Output:
# raw.orders.amount → SUM(amount) → analytics.revenue.total → dashboard.metrics.total_revenue

What you ALSO know (built-in): - ✅ Ultimate source: raw.orders.amount - ✅ Transformation: SUM(amount) - ✅ All intermediate steps - ✅ PII propagation path - ✅ Exact impact of changes

Time: 0.003 seconds

Beyond Traditional Lineage

The complete SQL graph unlocks capabilities far beyond just tracing columns:

Automatic Metadata Propagation

Document where you define. Extract automatically.

# Add metadata in your SQL comments
sql = """
SELECT
  user_id,  -- User ID [pii: false]
  email,    -- Email [pii: true, owner: data-team]
  SUM(revenue) as total  /* Total revenue [tags: metric] */
FROM users
GROUP BY user_id, email
"""

pipeline = Pipeline([("query", sql)], dialect="bigquery")

# Metadata extracted automatically
email = pipeline.columns["query.email"]
print(email.description)  # "Email"
print(email.pii)          # True
print(email.owner)        # "data-team"

# Or set programmatically and propagate
pipeline.columns["raw.users.phone"].pii = True
pipeline.propagate_all_metadata()

# Query anywhere
pii_columns = pipeline.get_pii_columns()
metrics = pipeline.get_columns_by_tag("metric")

Governance that scales with your pipeline, not against it.

Precise Impact Analysis

Know exactly what breaks. Not "probably everything".

# Trace forward from ANY column
affected = pipeline.trace_column_forward(
    "raw.orders", "amount"
)

# Returns exact downstream impact:
# - staging.orders.total_amount (SUM aggregation)
# - analytics.revenue.total_revenue (direct flow)
# - dashboard.metrics.avg_order_value (calculation input)

Change with confidence.

Context-Aware LLM Documentation

The graph provides context. LLMs provide clarity.

from langchain_openai import ChatOpenAI

pipeline.llm = ChatOpenAI(model="gpt-4")
pipeline.generate_all_descriptions()

# LLM sees:
# - Source columns and types
# - Transformations (SUM, JOIN, CASE)
# - Filter conditions (WHERE clauses)
# - Aggregation logic (GROUP BY)

Documentation that understands your data flow.

Text-to-SQL Without Hallucination

The graph provides real context. No more made-up table names.

# Your graph knows:
# - Actual table names
# - Actual column names
# - Actual relationships
# - Actual transformations

# LLM query with context
query = "Show me total revenue by customer"

# Graph provides:
# - Tables: raw.orders, staging.orders, analytics.customer_metrics
# - Columns: customer_id, amount, total_amount
# - Relationships: customer_id joins, amount → SUM(amount)

# Result: Accurate SQL, no hallucinated columns

Context-aware queries. Zero hallucination.

Pipeline Execution

Graph → DAG. Deploy anywhere.

# Synchronous execution
results = pipeline.run(executor=my_executor, max_workers=4)

# Async execution
results = await pipeline.async_run(executor=my_async_executor)

# Airflow DAG
dag = pipeline.to_airflow_dag(
    executor=my_executor,
    dag_id="my_pipeline",
    schedule="@daily"
)

Write once. Execute everywhere.

Real-World Impact

90% Time Savings

Hours → Seconds for tracing data issues
10x Faster Documentation

LLM-powered generation with lineage context
100% Compliance

Automatic PII propagation
~0% Additional Cost

Point to your existing SQL, get everything

Get Started

# Basic installation
pip install clpipe

# With LLM support
pip install clpipe[llm]

from clpipe import Pipeline

# Parse your SQL pipeline
pipeline = Pipeline.from_sql_files("queries/", dialect="bigquery")

# The complete graph is built automatically
# - Table lineage: pipeline.table_graph
# - Column lineage: pipeline.columns, pipeline.edges

# Trace any column
sources = pipeline.trace_column_backward("final_table", "metric")
affected = pipeline.trace_column_forward("source_table", "field")
pii_columns = pipeline.get_pii_columns()

Get Started Learn More

What Makes clpipe Different?

Feature	Traditional Tools	clpipe
Table lineage	✅	✅
Column lineage	❌ or Premium	✅ By Default
Transformation tracking	❌	✅
Metadata propagation	❌	✅
PII tracking	Manual	Automatic
Impact analysis	Vague	Exact
LLM documentation	❌	✅
Additional cost	High	~0%

Built by data engineers, for data engineers.

View on GitHub