Intelligent caching system

Understand how Xorq’s caching system optimizes performance and enables automatic reuse

You run the same expensive aggregation repeatedly during development while the input data hasn’t changed at all. Recomputing everything from scratch each time wastes minutes or hours that could be spent on actual development. Xorq’s caching system detects when data is unchanged and skips the recomputation automatically, returning cached results instantly.

What you’ll understand

This page explains the following concepts:

  • How Xorq’s caching detects source data changes through modification time tracking for automatic invalidation
  • When automatic invalidation saves development time versus when snapshots prevent invalidation confusion
  • What you gain in instant reuse and automatic change detection versus what you lose in storage overhead and complexity
  • How to choose between SourceCache, ParquetCache, SourceSnapshotCache, and ParquetSnapshotCache based on patterns

What is the intelligent caching system?

Xorq’s caching system stores intermediate results from computations so you don’t recompute expensive operations on every execution. When you call .cache() on an expression, Xorq marks it for caching and saves results to reuse them automatically. The system is intelligent because it knows when to invalidate cache automatically based on source data changes.

If source data changes, then Xorq detects this and recomputes with updated data rather than serving stale results. If nothing changed, then you get cached results instantly without any recomputation overhead or latency from query execution.

import xorq.api as xo
from xorq.caching import ParquetCache

con = xo.connect()
data = con.read_parquet("large_file.parquet")

# Expensive operation: filter and aggregate 100GB
result = (
    data
    .filter(xo._.amount > 1000)
    .group_by("category")
    .agg(total=xo._.amount.sum())
    .cache(cache=ParquetCache.from_kwargs(source=con))  # Cache results
)

# First run: computes and caches (slow)
result.execute()

# Second run: uses cache (instant)
result.execute()

Why recomputation wastes development time

Recomputing everything on every run means you pay the full execution cost each time you iterate. If you’re iterating on a feature engineering pipeline, then you might run the same expensive join ten times while tweaking logic. This approach creates three critical problems that waste computational resources and slow down development velocity.

Wasted compute costs money

Running the same query repeatedly costs money and time without any benefit when data hasn’t changed. A 10-minute aggregation running hourly wastes the full execution time on every run, even if input data remains constant. That’s 240 minutes daily when 10 minutes would suffice for the entire day with caching.

Iteration grinds to a halt

Every change requires recomputing from scratch, which slows down feedback loops and kills development velocity. Data scientists wait minutes for results they’ve already computed in previous iterations of the same analysis. Development velocity drops to a crawl because feedback loops are too slow for productive exploration.

Manual cache management fails

Manual caching requires you to remember when to invalidate, which leads to serving stale data or wasting compute. If you forget to invalidate, then you use stale data from old logic or old source data. If you invalidate too aggressively, then you waste compute rerunning queries that didn’t actually need recomputation.

Intelligent caching solves these problems by automatically detecting when source data changes and invalidating only affected entries.

How intelligent caching works

Xorq’s caching operates through four sequential stages that transform expensive operations into instant cache hits on subsequent runs.

When you call .cache(), Xorq generates a cache key based on computation logic and optionally source data modification times, including operations, filters, and joins. Before executing, Xorq checks if a valid cache entry exists. If cache hits occur, then results return instantly. If cache misses or invalid entries occur, then Xorq executes the query and stores results in configured storage. On subsequent runs, Xorq checks if source data changed. If data changed, then cache invalidates. If data is unchanged, then cached results return.

Cache keys are based on computation logic and source data modification times for correct invalidation semantics. Two different queries on the same data produce different cache keys because logic differs fundamentally. The same query on changed data produces a different key because modification times are included in the hash, which triggers automatic invalidation.

Tip

Xorq’s caching is lazy rather than eager. Calling .cache() doesn’t execute immediately at all. It marks the expression for caching. Caching happens when you call .execute() to trigger evaluation.

Cache types

Xorq provides four cache types, each optimized for different scenarios based on invalidation needs and persistence requirements.

SourceCache

SourceCache automatically invalidates when upstream data changes by tracking source modification times, which ensures results stay current with source data. It stores cached data in the source backend for convenience and backend integration. DuckDB, PostgreSQL, and other backends work with this approach.

Use SourceCache when you need automatic invalidation as source data changes unpredictably during development. Building production pipelines benefits from automatic invalidation that prevents serving stale data without manual intervention.

SourceCache tracks source data modification times. When a source file or table changes, cache invalidates automatically.

from xorq.caching import SourceCache

con = xo.connect()
cache = SourceCache.from_kwargs(source=con)

# Cache automatically invalidates if source data changes
cached = data.filter(xo._.amount > 100).cache(cache=cache)

ParquetCache

ParquetCache persists results as Parquet files on disk. It combines automatic invalidation with durable storage that survives process restarts, so results can be shared across sessions.

Use ParquetCache when you want persistent cache across sessions for iterating on pipelines locally with efficiency. Efficient columnar storage makes reading cached results fast while modification time tracking provides automatic invalidation.

ParquetCache writes results to Parquet files in a specified directory while tracking source data modification times.

from pathlib import Path
from xorq.caching import ParquetCache

cache = ParquetCache.from_kwargs(source=con, base_path=Path.cwd() / "cache")

# Results persist as Parquet files
cached = expensive_operation.cache(cache=cache)

SourceSnapshotCache

SourceSnapshotCache stores results without automatic invalidation, so you control when to invalidate manually for reproducibility. Fixed snapshots provide reproducibility guarantees since results never change unless you explicitly delete the cache.

Use SourceSnapshotCache when you want fixed snapshots for reproducibility in one-off analyses or research work. If source data is stable and you want manual control over cache lifecycle, then this approach works well since automatic invalidation would interfere with reproducibility goals.

SourceSnapshotCache stores results in the source backend but doesn’t check modification times for invalidation logic.

from xorq.caching import SourceSnapshotCache

cache = SourceSnapshotCache.from_kwargs(source=con)

# Cache never invalidates automatically
snapshot = data.filter(xo._.year == 2024).cache(cache=cache)

ParquetSnapshotCache

ParquetSnapshotCache combines Parquet persistence with snapshot semantics for durable archives and reproducible research. No automatic invalidation occurs with this cache type.

Use ParquetSnapshotCache when you want durable snapshots for reproducible research so results persist as files. Archiving analysis results benefits from Parquet storage without automatic invalidation, which prevents archived outputs from changing unexpectedly.

ParquetSnapshotCache works like ParquetCache but without modification time tracking. Results persist until you delete them manually.

from xorq.caching import ParquetSnapshotCache

cache = ParquetSnapshotCache.from_kwargs(source=con, base_path=Path.cwd() / "snapshots")

# Durable snapshot that never auto-invalidates
archive = analysis_result.cache(cache=cache)

Choosing the right cache type

Use this decision framework to select the appropriate cache based on invalidation needs and persistence requirements.

When to use each cache type:

  • Need automatic invalidation? Use SourceCache or ParquetCache to automatically recompute when source data changes.
  • Need persistent storage across sessions? Use ParquetCache or ParquetSnapshotCache for durability and sharing capabilities.
  • Source data changes frequently? Use SourceCache or ParquetCache since they automatically detect changes and recompute when needed.

Cache comparison

Cache type Auto-invalidation Persistence Best for
SourceCache Yes Backend-dependent Production pipelines with changing data
ParquetCache Yes Parquet files on disk Local development with changing data
SourceSnapshotCache No Backend-dependent One-off analyses, manual control
ParquetSnapshotCache No Parquet files on disk Reproducible research, archiving

How cache invalidation works

Xorq uses different strategies to determine when cache is still valid based on cache type selection.

SourceCache and ParquetCache track source data modification times. When a source file or table’s last modified time changes, then the cache invalidates automatically without manual intervention or configuration.

Snapshot caches provide no automatic invalidation at all. SourceSnapshotCache and ParquetSnapshotCache leave cache valid indefinitely, so cache remains valid until you manually delete it through filesystem operations or explicit cache clearing commands.

Storage type Hash components
In-memory Data bytes + Schema
Disk-based Query plan + Schema + Modification time
Remote Table metadata + Last modified time

Automatic invalidation resembles a smart refrigerator that knows when food expires based on expiration dates. Snapshot caching is more like a freezer, where you decide when to throw things out based on judgment.

Multi-engine caching

Xorq’s caching works across multiple engines, so you can cache results from PostgreSQL in DuckDB.

# Load from PostgreSQL
pg = xo.postgres.connect_env()
data = pg.table("large_table")

# Transfer to DuckDB and cache there
db = xo.duckdb.connect()
cached = (
    data
    .into_backend(db)
    .filter(xo._.amount > 1000)
    .cache(cache=ParquetCache.from_kwargs(source=db))
)

# Subsequent runs use DuckDB cache, not PostgreSQL
result = cached.execute()

This pattern works well for hybrid workflows since you only load data from slow remote databases once. Cache it locally in DuckDB, then iterate fast on the cached data to avoid hitting remote databases repeatedly.

When intelligent caching matters

Caching isn’t always beneficial. For fast queries or one-off analyses, the overhead exceeds the value. Here’s how to decide when caching makes sense for your workflow.

Use intelligent caching when

  • You’re iterating on pipelines and rerunning the same operations repeatedly during development. Caching eliminates recomputation entirely.
  • Your upstream operations are expensive, with large joins or aggregations taking minutes or hours. Caching provides clear performance wins.
  • Your source data changes infrequently relative to your iteration speed. If data updates daily but you develop hourly, then cache hits outweigh misses significantly.
  • You need to reduce load on production databases during development. Caching keeps development query load off production infrastructure.
  • You want automatic change detection. SourceCache detects changes for you without manual tracking or invalidation logic.
  • Multiple team members run the same expensive transformations. Shared cache eliminates duplicate computation across the team.

Use snapshot caching when

  • You need fixed results for reproducibility in research or compliance contexts. Snapshot caching provides results that never change unless explicitly deleted.
  • Your source data is stable and you want explicit control over invalidation timing. Automatic tracking would interfere with reproducibility.
  • You’re archiving analysis outputs. Snapshots persist indefinitely so source changes won’t invalidate archived results unexpectedly.

Skip caching when

  • Your operations are cheap, completing in under one second. Cache overhead exceeds the benefit from avoided recomputation.
  • Your source data changes constantly in real-time streams. Cache invalidates too frequently to provide any performance benefit.
  • You’re running one-off queries that won’t repeat. Single execution means no reuse opportunity exists.
  • Cache storage costs more than recomputation. In some cloud configurations, compute is cheap but storage is expensive.
  • Your backend has built-in caching like DuckDB temp tables or Snowflake query cache. Additional caching adds redundancy without benefit.

Understanding trade-offs

Intelligent caching offers significant benefits, but it comes with costs. Here’s what you gain:

  • Faster iteration: Expensive computations that take 30 minutes become instant on cache hits, which dramatically improves productivity during development.
  • Reduced database load: Caching avoids hammering production databases with repeated queries, keeping development load off production systems.
  • Automatic change detection: SourceCache and ParquetCache invalidate when data changes without manual tracking or configuration.
  • Persistent storage: Parquet caches survive across sessions so results can be shared across teams for collaboration on expensive computations.
  • Multi-engine support: Cache results from any backend in any other backend, providing flexible workflows and hybrid execution patterns.

Here’s what you give up:

  • Storage overhead: Cached data consumes disk space or database storage. A 100GB dataset requires 100GB of cache storage.
  • Invalidation complexity: Tracking modification times can fail with some file systems since certain storage backends don’t support modification time tracking reliably.
  • Stale data risk: Snapshot caches can serve stale results if you forget to invalidate manually when data changes.
  • Cache management: You must monitor and clean up old cache entries since disk space management becomes necessary over time.
  • Overhead for cheap operations: Sub-second queries might be slower with caching since cache overhead can exceed execution time for very fast queries.
Note

Caching is still lazy execution. Calling .cache() doesn’t execute the query or trigger any computation. It marks the expression for caching. Caching happens when you call .execute() to evaluate results. This differs from Ibis where .cache() executes eagerly and returns results immediately without further execution needed.

Learning more

Why deferred execution explains how caching works with deferred execution. How Xorq works shows where caching fits in the pipeline.

Content-addressed hashing discusses how cache keys are generated. Multi-engine execution details how caching works across backends.

Explore caching tutorial provides hands-on practice with caching. Optimize pipeline performance guide covers production caching strategies.