How Xorq works
Imagine writing Python code that doesn’t run immediately but instead builds a blueprint of your computation. Xorq captures this blueprint as a versioned manifest that can execute on any supported engine. This architecture supports optimization, caching, and reuse across different backends without rewriting code for each engine.
What you’ll understand
After reading this page, you’ll understand:
- How Xorq processes computations from code to results
- What happens during expression building versus execution
- How manifests enable portability and versioning
- Where optimization and caching occur in the pipeline
What is Xorq’s architecture?
Xorq consists of six core components that work together to process computations from Python code to results:
Expression Graph: In-memory representation of your computation as a directed acyclic graph of operations. Built in Python, engine-independent.
Compiler: Transforms expression graphs into portable YAML manifests. Generates content hashes and produces four artifacts: expr.yaml, profiles.yaml, deferred_reads.yaml, and metadata.json.
Manifest: Declarative, engine-independent representation of your computation stored as YAML. This is what gets versioned, cached, and executed.
Catalog: Registry that stores and versions manifests with human-readable aliases. Supports discovery and reuse across teams.
Executor: Loads manifests, compiles them to backend-specific SQL, checks caches, and runs queries on target engines.
Backend Engines: The actual compute engines where SQL executes and data lives. Examples include DuckDB, Snowflake, and PostgreSQL.
The key architectural principle: Xorq separates what you want to compute, captured in the manifest, from how to compute it, determined by the executor for each backend.
How Xorq processes computations
Xorq operates as a four-stage pipeline: expression building, manifest compilation, catalog registration, and execution. Each stage transforms your code step by step from Python expressions into executable results.
Where optimization happens
Xorq optimizes at three points in the pipeline:
- During optimization, when
.execute()is called: Operations fuse automatically. For example, consecutive filters merge into one. - During manifest compilation: The compiler eliminates dead code and simplifies expressions.
- During execution: The target engine’s optimizer handles SQL-level optimizations.
Xorq optimizes at both the logical level through the expression graph and at the physical level through the SQL execution plan.
Xorq uses Apache Arrow for zero-copy data transfers between engines. When you move data from DuckDB to PostgreSQL, Arrow’s columnar format avoids serialization overhead.
Stage 1: Expression building
When you write Xorq code, operations don’t execute immediately. Instead, Xorq builds an expression graph that represents your computation.
What happens: Each operation creates a node in the graph. Filter, join, and aggregate operations all become nodes. These nodes track dependencies, schemas, and backend information. No data moves, no queries run.
Why this matters: Deferred execution gives Xorq visibility into your entire pipeline before running anything. This allows optimization that’s impossible when operations execute immediately.
import xorq.api as xo
# Create sample data (no external files needed)
con = xo.connect()
data = xo.memtable({
"amount": [50, 150, 200, 75, 300],
"category": ["A", "B", "A", "B", "A"]
}, name="transactions")
# These operations build a graph, but they don't execute.
filtered = data.filter(xo._.amount > 100)
result = filtered.group_by("category").agg(total=xo._.amount.sum())
# Still no execution — just a graph in memory.
print(type(result)) # Table expression typeHere’s the key insight: The expression graph is an intermediate representation (IR) that’s independent of any specific engine. This IR is what makes multi-engine execution possible.
Expression building is fast because no computation happens. You’re just creating a data structure that describes what to compute.
Stage 2: Manifest compilation
When you run xorq build, Xorq compiles your expression graph into a YAML manifest. This manifest is a declarative, engine-independent representation of your computation.
What happens: The compiler walks your expression graph and generates YAML that captures operations, dependencies, schemas, and metadata. Each node gets a content hash based on its computation logic.
Why this matters: The manifest is the source of truth. It’s what gets versioned, cached, and executed. Two developers building the same expression get identical manifests with the same hash, which allows automatic reuse.
# Simplified manifest snippet
filtered_data:
op: Filter
kwargs:
table: source_data
predicates:
- op: Greater
left: amount
right: 100
schema:
amount: int64
category: string
hash: a3f5c9d2e1b4...The manifest includes four critical artifacts:
expr.yaml: Complete expression definition with all operations and schemas
profiles.yaml: Backend connection configurations showing which engines to use
deferred_reads.yaml: Information about data sources that load at execution time
metadata.json: Build timestamp, Xorq version, and dependency information
Think of it this way: The manifest is like a recipe. It tells you what ingredients you need, what steps to follow, and what you’ll get. But it doesn’t cook the meal.
Stage 3: Catalog registration
After building a manifest, you can register it in the catalog with a human-readable alias. The catalog is your team’s shared ledger of computations.
What happens: You run xorq catalog add builds/<hash> --alias feature-pipeline. The catalog stores the mapping between aliases and build hashes, supporting discovery and reuse.
Why this matters: Without the catalog, you’d need to remember or share long content hashes. With it, you reference computations by name and let Xorq handle versioning.
# Register a build
xorq catalog add builds/a3f5c9d2 --alias fraud-features
# Discover what exists
xorq catalog ls
# Output:
# Aliases:
# fraud-features a3f5c9d2 r2
# customer-features b1e4d7a9 r1The catalog tracks three things:
Aliases: Human-readable names for builds, for example
fraud-featuresBuild hashes: Content-addressed identifiers for exact computations
Revisions: Version numbers like r1, r2, or r3 when you update an alias
This supports powerful workflows. If someone on your team already computed fraud-features, you can reuse their cached results automatically. The hash ensures you’re getting exactly the same computation.
Stage 4: Execution
When you run xorq run builds/<hash> or call .execute() in Python, Xorq executes the manifest by compiling it to backend-specific SQL and running it on your target engine.
What happens: The executor reads the YAML manifest, generates optimized SQL for your target backend, checks the cache for existing results, and runs queries only when needed. DuckDB SQL differs from Snowflake SQL, so Xorq generates the appropriate dialect for each backend.
What we’re executing: The manifest stored as YAML files, not the original Python code. The manifest gets compiled to SQL, which executes on the backend engine and produces query results as data.
Why this matters: This is where backend-specific optimization happens. Xorq can push operations to the engine, eliminate unnecessary steps, and reuse cached results. Your Python code is long gone — only the manifest matters now.
The execution stage involves four steps:
Manifest loading: Read YAML files and reconstruct the expression graph.
Cache checking: Look for cached results based on content hash.
SQL compilation: Generate engine-specific SQL for the target backend.
Query execution: Run SQL on the target backend and return results.
Here’s the key insight: Because the manifest captures computation logic rather than data, the same manifest can execute on different engines. Xorq generates different SQL for each backend.
How the stages connect
The four stages form a pipeline where each stage’s output feeds the next:
- Expression building → Manifest compilation: Python code becomes YAML artifacts.
- Manifest compilation → Catalog registration: YAML artifacts get human-readable names.
- Catalog registration → Execution: Named computations run on demand with caching.
This pipeline provides three critical capabilities:
- Portability: The manifest is engine-independent, so you can switch backends without changing code.
- Versioning: Content hashes identify exact computations, supporting precise version control.
- Reuse: If anyone computed this before with the same hash, you get cached results automatically.
Learning more
If you’re new to Xorq, start with Overview to learn about the high-level architecture and components. Why deferred execution explains how lazy evaluation works in Xorq.
Expression format covers the detailed specifications of YAML manifest structure. Build system explains how xorq build works internally. Compute catalog details how the catalog supports discovery and reuse across teams.
Content-addressed hashing explains how Xorq generates content hashes. Multi-engine execution covers how one manifest runs on multiple backends.