User-defined functions

Understand how to extend Xorq with custom Python functions that work in SQL

Applying complex business rules requires custom logic beyond standard SQL operations and built-in functions. Built-in functions handle sum, filter, and join operations, but not your specific discount calculations. User-defined functions solve this by wrapping your Python logic for both expressions and queries.

What you’ll understand

What user-defined functions are and how they extend Xorq built-in operations for custom functionality
When to use UDFs versus built-in functions based on your performance and functionality requirements for workflows
How to create UDFs using make_pandas_udf by defining functions and specifying schemas for implementation
The trade-offs between extensibility and performance that UDFs involve in production systems and pipelines

What are user-defined functions?

User-defined functions (UDFs) are custom Python functions that work in both Xorq expressions and SQL queries. You write Python code that operates on pandas DataFrames, and Xorq makes it available as a function that executes consistently across different backends. This lets you extend Xorq’s built-in operations with domain-specific logic for custom business rules, API calls, and complex transformations that aren’t built into the system.

import xorq.api as xo
from xorq.expr.udf import make_pandas_udf
import xorq.expr.datatypes as dt

# Define your Python function
def calculate_discount(df):
    import numpy as np
    return np.where(df['is_member'], df['price'] * 0.9, df['price'])

# Wrap as UDF
discount_udf = make_pandas_udf(
    fn=calculate_discount,
    schema=xo.schema({"price": dt.float64, "is_member": dt.boolean}),
    return_type=dt.float64,
    name="calculate_discount"
)

# Use in expressions
result = customers.mutate(
    final_price=discount_udf.on_expr(customers)
)

Why user-defined functions matter

Built-in functions cover common operations like sum, filter, and join. However, every domain has specific logic that doesn’t fit standard operations. Without UDFs, you’d need to extract data to Python, apply your logic, then load it back. This breaks the pipeline and loses optimization opportunities.

This creates three problems:

Pipeline fragmentation slows execution. You can’t express your full computation in one pipeline. You extract data, process in Python, then load back. This requires serializing data to disk or memory, context switching between systems, and potentially losing parallelization. A transformation that could run in-database runs in Python instead, moving gigabytes unnecessarily.

No SQL integration limits collaboration. Your Python logic doesn’t work in SQL queries. Data analysts who prefer SQL can’t use your transformations. They either duplicate your logic in SQL, creating a maintenance burden, or ask you to extract and process data for them, creating bottlenecks.

Lost optimization reduces performance. When you extract data to Python, Xorq can’t optimize the full pipeline. Operations that could push to the database run in Python instead. Filter operations that could eliminate 99% of data happen after extraction, not before. The query optimizer has no visibility into your Python processing.

UDFs solve these by making Python functions first-class operations in Xorq. Your custom logic works in expressions, SQL queries, and across all backends.

How user-defined functions work

UDFs operate in three stages:

Function definition: You write a Python function that takes a pandas DataFrame and returns a Series or scalar. This function contains your custom logic.

UDF wrapping: You wrap the function with make_pandas_udf, specifying the input schema and return type. This creates a UDF constructor.

UDF application: You apply the UDF to expressions using .on_expr(). After registration, UDFs can be used in SQL queries by name. Xorq handles data conversion and execution.

UDFs are portable across different database backends, meaning the same UDF works on DuckDB, PostgreSQL, Snowflake, and other supported systems without modification. Xorq handles all the backend-specific implementation details and data format conversions automatically.

Tip

UDFs bridge Python and SQL by letting you write Python functions once and use them in both Python expressions and SQL queries. This enables collaboration between Python-first and SQL-first users without duplicating logic.

Creating UDFs with make_pandas_udf

The make_pandas_udf function creates scalar UDFs that process data row-by-row:

from xorq.expr.udf import make_pandas_udf
import xorq.expr.datatypes as dt

# 1. Define function
def bill_ratio(df):
    return df['bill_length_mm'] / df['bill_depth_mm']

# 2. Create schema
schema = xo.schema({
    "bill_length_mm": dt.float64,
    "bill_depth_mm": dt.float64
})

# 3. Wrap as UDF
bill_ratio_udf = make_pandas_udf(
    fn=bill_ratio,
    schema=schema,
    return_type=dt.float64,
    name="bill_ratio"
)

# 4. Apply to data
result = penguins.mutate(
    ratio=bill_ratio_udf.on_expr(penguins)
)

The function receives a pandas DataFrame where columns correspond to the schema keys. It returns a pandas Series or scalar compatible with the return type.

UDF types

Xorq supports two main UDF types:

Scalar UDFs

Process data row-by-row, returning one value per row:

def classify_size(df):
    def categorize(mass):
        if mass > 4500:
            return 'Large'
        elif mass < 3500:
            return 'Small'
        else:
            return 'Medium'
    return df['body_mass_g'].apply(categorize)

size_udf = make_pandas_udf(
    fn=classify_size,
    schema=xo.schema({"body_mass_g": dt.float64}),
    return_type=dt.string,
    name="classify_size"
)

Expression UDFs

Use pre-computed values like trained models in predictions:

from xorq.expr.udf import make_pandas_expr_udf, agg

# Train model (aggregation)
model_udaf = agg.pandas_df(
    fn=train_model,
    schema=train_schema,
    return_type=dt.binary,
    name="train_model"
)

# Predict using trained model
predict_udf = make_pandas_expr_udf(
    computed_kwargs_expr=model_udaf.on_expr(train_data),
    fn=predict,
    schema=test_schema,
    return_type=dt.string,
    name="predict"
)

# Apply predictions
result = test_data.mutate(
    prediction=predict_udf.on_expr(test_data)
)

Expression UDFs enable ML workflows where you train on aggregated data and predict on new data in one pipeline.

Using UDFs in SQL

UDFs work in SQL queries after they are registered by using them in an expression first:

# Create UDF
bill_ratio_udf = make_pandas_udf(
    fn=bill_ratio,
    schema=schema,
    return_type=dt.float64,
    name="bill_ratio"
)

# Register UDF by using it in an expression
_ = penguins.mutate(ratio=bill_ratio_udf.on_expr(penguins)).execute()

# Now use in SQL
result = con.sql("""
    SELECT 
        species,
        bill_ratio(bill_length_mm, bill_depth_mm) as ratio
    FROM penguins
    WHERE ratio > 2.0
""")

The UDF name you specify becomes a SQL function after registration. Once registered, SQL users can call it just like built-in functions without needing Python knowledge.

When to use UDFs

Deciding when to use UDFs depends on your needs for custom logic and performance requirements.

Use UDFs when:

You need domain-specific logic not available in built-in functions, like custom business rules or industry calculations.
You want to share Python logic with SQL users to enable collaboration across teams.
You’re calling external APIs or services like geocoding, sentiment analysis, or price lookups.
You need complex transformations that are easier in Python than SQL, like multi-step calculations or conditional logic.
The operation is specific to your domain and not a general-purpose operation.

Use built-in functions when:

Xorq has the operation, like sum, filter, join, string operations, or date functions.
Performance is critical and the operation is on the hot path, executed millions of times.
The operation is common across many use cases as a standard transformation.
You can compose built-ins to achieve your goal by combining filter, mutate, and group_by.

Example decision

Summing a column or filtering rows works better with built-in operations. This is because they run natively in the backend with full query planner optimization, making them significantly faster than UDF equivalents.

UDF best practices

Following these practices helps you write maintainable UDFs that perform well in production.

Keep UDFs focused

Each UDF should do one thing well:

# Good: focused UDF
def calculate_discount(df):
    return df['price'] * 0.9

# Bad: UDF doing too much
def process_order(df):
    # Calculates discount, applies tax, formats output...
    # Too many responsibilities

Handle null values

Always handle missing data explicitly:

def safe_divide(df):
    result = df['numerator'] / df['denominator']
    return result.fillna(0)  # Handle division by zero

Use type hints

Type hints improve readability and catch errors:

import pandas as pd

def classify_size(df: pd.DataFrame) -> pd.Series:
    return df['body_mass_g'].apply(lambda x: 'Large' if x > 4500 else 'Small')

Test UDFs independently

Test your function with pandas DataFrames before wrapping as UDF:

# Test function directly
test_df = pd.DataFrame({"price": [100, 200], "is_member": [True, False]})
result = calculate_discount(test_df)
assert result.tolist() == [90, 200]

# Then wrap as UDF
discount_udf = make_pandas_udf(...)

Trade-offs

UDFs provide extensibility for custom logic, but they introduce performance overhead compared to built-in operations.

Benefits:

Extensibility: Add any logic you need; no waiting for Xorq to implement it.
Portability: Same UDF works across backends like DuckDB, PostgreSQL, Snowflake, and others.
SQL integration: Python logic available in SQL, enabling team collaboration.
Reusability: Define once, and use in multiple pipelines and queries.
Maintainability: Update logic in one place, and all uses get the update.

Costs:

Performance: UDFs are slower than built-in operations, typically 2-10x depending on the operation.
Serialization overhead: Data converts between Arrow and pandas, adding latency.
Limited optimization: Xorq can’t optimize inside UDFs; it treats them as black boxes.
Debugging complexity: Errors might be unclear; stack traces span Python and backend.
Maintenance burden: Custom code requires documentation and testing.

When the trade-off is worth it:

If built-in functions can do the job like filtering, aggregation, or joins, then use them for better performance. Built-in operations run natively in the backend with full query optimization, while UDF operations serialize data to Python, adding overhead.

Learning more

Overview explains Xorq’s architecture and how UDFs fit into the system.

User-defined exchange functions covers UDXFs for distributed processing.

Create your first UDF tutorial provides hands-on UDF practice. UDF API reference covers complete UDF documentation.

What you’ll understand

What are user-defined functions?

Why user-defined functions matter

How user-defined functions work

Creating UDFs with make_pandas_udf

UDF types

Scalar UDFs

Expression UDFs

Using UDFs in SQL

When to use UDFs

Use UDFs when:

Use built-in functions when:

Example decision

UDF best practices

Keep UDFs focused

Handle null values

Use type hints

Test UDFs independently

Trade-offs

Benefits:

Costs:

When the trade-off is worth it:

Learning more

See also