Feature serving patterns

Understand patterns for serving ML features in online and offline scenarios

ML systems need features in two contexts: Batch training on historical data and real-time inference on live requests. Traditional systems force you to write feature logic twice. You write once for training in Python/Spark and once for serving in a REST API. The training version and production version drift apart. This causes training-serving skew where models train on different features than they see in production.

Xorq solves this by letting you write feature logic once as an expression that works in both contexts. A feature serving pattern defines where and when this computation happens. Your choice determines latency, freshness, and infrastructure complexity.

The fundamental trade-off

Feature serving requires choosing between computation speed and data freshness, and you cannot optimize both simultaneously. If you precompute features, then you get fast lookups but stale data. If you compute on every request, then you get up-to-date data but slower responses. Patterns represent different positions on this spectrum.

What is a feature serving pattern?

A feature serving pattern determines when feature computation happens relative to prediction requests. This timing decision controls three critical properties: response latency, feature freshness, and computational cost.

Batch precomputation runs feature computation in scheduled jobs before requests arrive. At request time, you look up precomputed values from storage. This gives microsecond lookups, but features might be hours or days old.

On-demand computation runs feature computation when the request arrives. This gives up-to-date features but adds computation time to every request latency.

Hybrid computation precomputes expensive features in batch and computes cheap features on-demand, balancing speed and freshness by splitting the workload between the two approaches.

Why the same expression prevents drift

Traditional ML systems separate training and serving code. Training uses SQL or Spark queries for historical data. Serving uses Python or Java functions for real-time requests. Someone updates the training logic but forgets the serving logic. Aggregation formulas drift, causing models to train on one feature definition but see different features in production.

Xorq solves this through deferred execution, where the expression defines computation without executing it immediately. Training executes the expression on historical data while serving executes it on current data. Same logic everywhere produces zero drift.

This is why patterns matter. If the same expression works in both contexts, then you can choose when to execute it. Batch execution happens before requests, while on-demand execution happens during requests, but the computation logic stays identical regardless of timing.

Pattern 1: Batch precomputation

Batch precomputation separates computation from serving. Scheduled jobs compute features periodically for all entities. Results store in fast-access storage so requests perform lookups instead of computation.

The mental model

Think of batch precomputation like a phone book where the phone company compiles all numbers once. When you need a number, you look it up instantly, accepting that new numbers take time to appear in the book.

How timing works

Understanding when computation and serving happen clarifies why batch precomputation delivers fast responses.

Before requests: This is the batch phase. Scheduled jobs run on a regular schedule. The expression executes on the entire dataset. Results write to storage indexed by entity ID. You can now query storage for fast lookups.

During requests: This is the serving phase. Requests arrive with entity IDs. The system looks up precomputed features by ID and returns cached results immediately. No computation happens during the request.

When to use batch precomputation

Specific scenarios make batch precomputation the optimal choice for feature serving:

  • Features require expensive computation, like multi-table joins or long time windows
  • You can tolerate staleness where hourly, daily, or weekly updates are acceptable
  • You need sub-10ms response latency
  • You serve millions of entities where batch spreads cost efficiently

Real examples:

  • Customer lifetime value updated daily
  • Fraud risk scores recomputed hourly
  • User segments updated weekly
  • 90-day spending aggregations

Why this pattern works

Batch precomputation spreads expensive computation across all entities at once, reducing per-entity cost. Computing features for one million customers in batch takes about the same time as computing for one customer individually, so the per-request cost drops to simple lookup overhead.

Storage becomes the serving layer. Fast storage, such as in-memory cache or local SSD, provides consistent microsecond latency, making complex computation irrelevant to serving performance.

Trade-offs

Every pattern involves accepting certain limitations to gain specific benefits.

You gain:

  • Ultra-fast lookups (1-10ms) regardless of feature complexity
  • Predictable latency since computation happened earlier
  • Efficient resource usage through batching

You accept:

  • Stale features that are hours to days old
  • Storage infrastructure requirements
  • Batch job management overhead
  • New entities appear only after the next batch run

Organizational implications

Batch patterns require data engineering capabilities, including skills in scheduling jobs, managing storage, and monitoring batch pipelines. If batch jobs fail, then serving stops working, creating operational dependencies.

Pattern 2: On-demand computation

On-demand computation merges computation and serving. No precomputation occurs, so every request triggers computation using the current data, and results are never cached between requests.

The mental model

Think of on-demand like a restaurant cooking to order, where you wait longer for your meal, but the food is prepared exactly when you need it, and nothing sits waiting to be served.

How timing works

On-demand computation has only one phase, during which everything happens in real time.

During requests: This is the only phase. Requests arrive with entity IDs. The expression executes immediately with current data. Computation runs with filters, joins, and aggregations. Results return directly to the requester. Nothing persists for the next request.

No batch phase exists, and no precomputation happens, so every request pays the full computation cost.

When to use on-demand computation

Specific requirements make on-demand computation the right choice for your serving needs:

  • Features must be up-to-date within seconds, like the current cart or the last action
  • Features compute quickly with simple filters on recent data only
  • You tolerate 50-500ms latency
  • You serve thousands of entities, not millions

Real examples:

  • Current shopping cart contents
  • Last 24-hour activity summaries
  • Real-time session features
  • Most recent user action

Why this pattern works

On-demand computation eliminates infrastructure complexity. No batch jobs to schedule, no storage to manage, and no cache invalidation logic means the database is the only dependency.

Features are guaranteed to be up-to-date. If data changes one second ago, then features reflect that change immediately with no waiting for batch jobs to catch up.

Trade-offs

On-demand computation offers distinct benefits and limitations compared to batch processing.

You gain:

  • Always up-to-date features reflecting the current database state
  • Simplified infrastructure with no batch jobs or storage
  • Immediate feature updates when data changes

You accept:

  • Higher latency from computation on every request
  • Database load scaling with request volume
  • Unpredictable latency if computation complexity varies
  • Expensive features make all requests slow

Organizational implications

On-demand patterns require backend engineering that relies on fast databases and optimized queries. Database performance directly impacts serving performance, so if queries slow down, then serving slows down.

Pattern 3: Hybrid computation

Hybrid computation splits features by computational cost. Expensive features use batch precomputation while cheap features use on-demand computation, and serving combines both at request time.

The mental model

Think of a hybrid like a restaurant with prep work where the kitchen prepares expensive components ahead, like stocks, sauces, and slow-cooked items, while quick components cook to order, like searing and garnishing, with assembly happening when you order.

How timing works

Hybrid computation operates in two distinct phases that work together to balance speed and freshness.

Before requests: This is the batch phase for expensive features. Scheduled jobs compute them only. Results store for fast lookup.

During requests: This is the combined phase. The system quickly looks up precomputed, expensive features. It computes cheap features from current data. It joins both feature sets and returns the combined results.

When to use hybrid computation

Hybrid patterns work best when you need to balance multiple competing requirements:

  • You need both speed and up-to-date data
  • Features are split naturally into expensive and cheap categories
  • Your latency budget is 50-200ms
  • You build production ML systems at scale

Real examples:

  • Fraud detection combining lifetime patterns with recent behavior
  • Recommendations combining user preferences with the current session
  • Credit scoring combines payment history with recent applications

Why this pattern works

Hybrid patterns optimize the trade-off curve. You get fast lookups for expensive computation and up-to-date data for cheap computation, achieving better latency than pure on-demand and better freshness than pure batch.

The pattern adapts to your specific features. If 90% of the computation cost comes from three features, then batch those features while computing the remaining features on demand, optimizing where it matters most.

Trade-offs

Hybrid computation balances the benefits of both batch and on-demand patterns while introducing its own complexity.

You gain:

  • Balanced speed from batch and up-to-date data from on-demand
  • Flexibility to tune based on specific requirements
  • Optimal performance for most production systems

You accept:

  • Increased architectural complexity with two computation paths
  • Need to analyze which features should be batch versus on-demand
  • Requirements for both storage infrastructure and database access
  • More components that can fail independently

Organizational implications

Hybrid patterns require both data engineering and backend engineering, where teams need skills for batch jobs, storage, databases, and serving coordination. Complexity increases, but so does optimization potential.

Choosing the correct pattern

Your choice depends on three constraints: latency requirements, feature characteristics, and organizational capabilities. These constraints interact to determine which pattern fits your specific needs.

Decision framework

The following tables help you evaluate which pattern fits your specific constraints.

Budget Pattern Why
Under 10ms Batch only No time for computation, must be a lookup
10-50ms Batch or Hybrid Can compute simple features
50-200ms Hybrid Can compute moderate features
Over 200ms On-demand viable Sufficient time for complex computation
Feature Type Time Window Pattern Why
Lifetime aggregations All history Batch Too expensive for real-time
90-day rolling 3 months Batch Large data volume
Daily summaries 24 hours Hybrid Medium complexity
Hourly features 1 hour On-demand Small data volume
Current session Minutes On-demand Must be up-to-date
Capability Enables Pattern Why
Data engineering team Batch Can manage scheduled jobs and storage
Backend engineering team On-demand Can optimize database queries
Both teams Hybrid Can coordinate both approaches
Small team On-demand Simpler with fewer moving parts

Common misunderstandings

These misconceptions can lead to poor architectural decisions when choosing feature serving patterns.

“Online” does not mean “fast”

Online serving means synchronous request-response, not necessarily low latency. Computing expensive features on demand is still online, even if requests take seconds. If you need sub-100ms latency, then you need batch precomputation or straightforward on-demand features.

Batch can be frequent

Running batch jobs every minute provides minute-level freshness. The batch-versus-on-demand distinction concerns where computation occurs, not the update frequency, since computation occurs in scheduled jobs or on request. A batch job running every minute is still a batch job, not an on-demand one.

You do not always need feature stores

Traditional feature stores solve training-serving skew by storing precomputed features. If your expression works in both contexts, then you already solved skew, and storage becomes just caching for batch patterns. Many systems need only expressions and caching, not separate feature store infrastructure.

Preventing training-serving skew

The same expression must define features for both training and serving. Do not reimplement logic in different languages or frameworks, because if training uses one expression and serving reimplements that logic, then drift will occur over time.

The expression is the contract. Training executes it on historical data while serving executes it on current data, guaranteeing identical computation logic.

When patterns do not apply

If your features never repeat across requests, then serving patterns add unnecessary complexity, and you should compute features directly in application code instead.

If each prediction requires unique custom logic, then patterns offer no reuse benefit, and the overhead of expressions and serving infrastructure outweighs the gains.

Patterns work when computation logic repeats. Repetition enables optimization through timing choices, so without repetition, simpler approaches work better.

Learning more

Serving expressions as endpoints explains how expressions become serving endpoints and the architecture that enables this.

Intelligent caching system covers caching strategies that power batch patterns, including time-to-live and invalidation logic.

Point-in-time correctness discusses temporal correctness for features that change over time, preventing data leakage in training.