SlimONNX: A Story of Optimizing Neural Networks for Verification

SlimONNX: A Story of Optimizing Neural Networks for Verification¶

January 2026 Technical Deep-Dive 29 min read 5,773 words

Zhongkui Ma

ONNX Optimization Graph Simplification Verification

Introduction¶

I spent three years verifying neural networks before I realized the tools and I had different optimization goals. Existing tools optimize for inference speed—making models run fast at deployment. Verification needs something different: mathematical analyzability—making models provably safe.

Here’s the challenge: You train a beautiful neural network in PyTorch or TensorFlow. Everything works perfectly. Then you export it to ONNX for verification, and you’re working with a computational graph that, while semantically correct, contains patterns that verification tools find difficult to analyze: redundant operations, unfused layer pairs, and operators that could be combined for clearer mathematical structure. The model computes the same outputs, but the representation isn’t optimized for formal analysis.

Existing tools like ONNX Simplifier are excellent at what they’re designed for: inference deployment. They reduce runtime overhead, simplify graphs for faster execution, and optimize memory usage. For verification workflows, we have different priorities. This doesn’t make one tool better than the other—they serve different communities with different needs. We don’t just need faster models; we need analyzable models with explicit layer structure, deterministic behavior, and mathematical properties that formal verification tools can reason about.

That gap motivated SlimONNX: a pure Python toolkit for optimizing ONNX models specifically for verification workflows, achieving 100% optimization success across 23 VNN-COMP 2024 benchmarks with 100% ONNXRuntime compatibility. Not “make it run faster,” but “make it verifiable while preserving correctness.” This post is the story of building that tool—the technical challenges, the design decisions, and what I learned about the surprisingly difficult problem of making neural networks provably safe.

We’ll cover:

Why verification needs different optimizations than inference
The architecture and design philosophy behind SlimONNX
Deep technical dives into operator fusion, shape inference, and version compatibility
Real-world validation on 100+ models from VNN-COMP 2024
What I learned about the gap between research and production

If you work in neural network verification, ML safety, or just wonder how to make AI systems actually trustworthy, this is for you.

Part 1: Understanding Different Design Goals¶

Let’s start with fundamentals. ONNX (Open Neural Network Exchange) was designed with a clear goal: enable interoperability between ML frameworks. Train in PyTorch, deploy with TensorFlow Lite. Train in TensorFlow, optimize with NVIDIA TensorRT. The promise is beautiful—write once, run anywhere.

ONNX prioritizes portability and semantic fidelity as its primary design goals. When you export a model to ONNX, the framework faithfully translates your high-level operations into ONNX operators, preserving the computational semantics exactly. This is exactly what cross-framework compatibility requires—correct computation matters more than optimal representation.

ONNX Design Goals vs. Verification Needs¶

ONNX’s priorities:

Cross-framework compatibility: Every framework’s quirks must map to ONNX operators
Semantic fidelity: The exported model must compute exactly the same outputs
Broad hardware support: Models should run on CPUs, GPUs, NPUs, edge devices

These are all good goals! But they create specific patterns in exported models:

Redundant operations: Operations like Add(x, 0) or Mul(x, 1) appear frequently
Unfused layers: MatMul followed by Add instead of a single Gemm operation
Identity transformations: Reshape nodes that don’t actually change tensor shape
Complex graph structure: Topological ordering may not be canonical, making comparison difficult

Verification’s requirements:

Explicit layer structure: We need to see what the network is actually computing at each layer
Minimal graph complexity: Fewer nodes = tighter bounds, faster verification
Deterministic representation: Two equivalent models should have identical graphs
Mathematical analyzability: Operations should be in forms that verification tools understand

The gap is clear. Inference optimization cares about runtime speed. Verification cares about mathematical structure.

Example: A Simple PyTorch Network¶

Let’s make this concrete. Here’s a trivial neural network in PyTorch:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# Export to ONNX
model = SimpleNet()
dummy_input = torch.randn(1, 10)
torch.onnx.export(model, dummy_input, "simple.onnx", opset_version=17)

What you expect: Three operations (Linear → ReLU → Linear).

What you get in ONNX (This is not a very good example and it maybe unrealistic, but for illustration; the real case maybe some programmers uses matrix multiplication and addition to implement linear layers):

Input → MatMul → Add → Relu → MatMul → Add → Output

Still pretty clean, right? But even here, there are optimization opportunities:

MatMul + Add can fuse to Gemm (General Matrix Multiplication with bias)
Node ordering might not be topological (depending on export version)
Names like “/fc1/MatMul_output_0” obscure structure (what layer is this?)

Now scale this to ResNet-50, ViT, or a custom architecture with batch normalization, skip connections, and complex data flow. Suddenly you’re looking at thousands of nodes with patterns like:

Conv → Reshape → BatchNormalization → Reshape → Relu

When this could be:

Conv → Relu  (with BN folded into Conv weights)

Existing Tools and Their Design Goals¶

ONNX Simplifier (onnxsim) is the de facto standard for ONNX optimization, and it’s excellent at what it’s designed for—inference deployment:

Constant folding (evaluate constant expressions at compile time)
Shape inference (determine tensor shapes statically)
Basic redundancy removal (eliminate identity operations)
Fast optimization (seconds even for large models)

These optimizations prioritize runtime speed and memory efficiency, which is exactly right for deploying models to production. For verification workflows, we have different priorities—not better or worse, just different:

Table 2 Inference vs. Verification Optimization Goals¶
Optimization Goal	Inference Priority	Verification Priority
Minimize runtime	✓ Critical	○ Nice to have
Explicit layer structure	○ Don’t care	✓ Critical
Deterministic representation	○ Don’t care	✓ Critical
Mathematical interpretability	○ Don’t care	✓ Critical
Conservative correctness	○ Validate if needed	✓ Mandatory

The verification gap: Existing optimizers may fuse operations too aggressively (hiding layer boundaries), skip optimizations that don’t affect speed (but do affect verification complexity), or produce non-deterministic output (different runs give different graphs).

SlimONNX fills this gap by prioritizing verifiability over raw performance. It’s not about making models run faster—it’s about making them analyzable.

Real-World Impact¶

When preparing models for VNN-COMP 2024 (the International Verification of Neural Networks Competition), I encountered:

Models with redundant nodes (operations that could be eliminated)
Fusible MatMul+Add patterns (opportunities to reduce node count)
Inconsistent Conv+BatchNorm representations (some fused, some not)
Non-canonical graph orderings (same model exports to different graphs)

After SlimONNX optimization:

Cleaner structure: Explicit layer boundaries for manual inspection
Faster verification: Fewer nodes = simpler bounds propagation
Reproducible results: Same model always optimizes to same graph
Tool compatibility: Simplified graphs work better with verification tools (α,β-CROWN, ERAN, etc.)

Note

For verification workflows, the structure of the computation graph matters as much as the numerical outputs. A model that’s mathematically equivalent but structurally different can have dramatically different verification complexity.

Part 2: Design Philosophy and Mathematical Foundations¶

Before diving into implementation details, let’s establish the theoretical foundations. SlimONNX isn’t just about removing nodes—it’s about preserving mathematical equivalence while optimizing for analyzability.

Design Principles¶

1. Semantic Equivalence as Invariant

Every transformation must preserve the mathematical function computed by the model. Formally, for any optimization \(\mathcal{O}\):

\[\forall x \in \mathcal{X}: f_{\text{original}}(x) = f_{\mathcal{O}(\text{original})}(x)\]

Where \(\mathcal{X}\) is the input domain and \(f\) is the model’s function. This isn’t just “close enough”—it’s bit-exact equality (modulo floating-point rounding).

2. Transparency Over Opacity

Optimization should make the model’s structure clearer, not more obscure. When we fuse Conv → BatchNorm into a single Conv, we’re not hiding complexity—we’re revealing that these two layers are mathematically equivalent to a single affine transformation.

3. Functional Composition

The optimization pipeline is built as a composition of pure functions:

from dataclasses import dataclass
from functools import reduce
from typing import Callable
from onnx import ModelProto

@dataclass(frozen=True)
class OptimizationPipeline:
    """
    Functional composition of graph transformations.

    Each transform is a pure function: ModelProto → ModelProto
    The pipeline composes them: T_n ∘ T_{n-1} ∘ ... ∘ T_1
    """
    transforms: list[Callable[[ModelProto], ModelProto]]

    def apply(self, model: ModelProto) -> ModelProto:
        """Apply all transformations in sequence."""
        return reduce(lambda m, t: t(m), self.transforms, model)

This makes the pipeline:

Testable: Each transform can be unit tested independently
Composable: New optimizations can be added without modifying existing ones
Deterministic: Same input always produces same output (no hidden state)

4. Verification-Aware Trade-offs

Some optimizations improve inference speed but hurt verification:

Aggressive constant folding: Can eliminate nodes verification tools use as landmarks
Operator fusion: Might hide layer boundaries needed for bound propagation
Graph reordering: Can change the canonical form verification tools expect

SlimONNX makes different choices:

Conservative fusion: Only fuse when it clarifies structure (Conv+BN) or is mathematically trivial (MatMul+Add→Gemm)
Preserve structure: Keep explicit layer boundaries
Canonical ordering: Ensure deterministic topological sort

Mathematical Correctness Framework¶

Let’s formalize what “correct optimization” means. Consider a neural network as a composition of layers:

\[f(x) = f_n \circ f_{n-1} \circ ... \circ f_1(x)\]

An optimization transforms this to \(g(x) = g_m \circ ... \circ g_1(x)\) where \(m \leq n\) (fewer layers). For correctness, we need:

Property 1: Functional Equivalence

\[\forall x: f(x) = g(x)\]

Property 2: Numerical Stability

The transformation should not significantly affect numerical precision. For floating-point arithmetic, we need:

\[\|f(x) - g(x)\|_\infty \leq \epsilon_{\text{mach}} \cdot C\]

Where \(\epsilon_{\text{mach}}\) is machine epsilon and \(C\) is a small constant (typically \(O(n)\)).

Property 3: Verification Compatibility

The optimized model \(g\) should have verification complexity \(\leq\) the original \(f\):

\[\text{VComplexity}(g) \leq \text{VComplexity}(f)\]

Where verification complexity includes:

Number of nodes (fewer is better for bound propagation)
Operator types (some are easier to analyze than others)
Graph structure (DAG with clear layers vs complex dependencies)

Software Engineering Principles¶

Immutability: All configuration objects are frozen dataclasses. State changes happen explicitly through transformations, never through mutation.

@dataclass(frozen=True)
class OptimizationConfig:
    """Immutable optimization configuration."""
    fuse_matmul_add_to_gemm: bool = True
    fuse_conv_bn: bool = True
    eliminate_identity_ops: bool = True
    # ... 10+ more flags

Separation of Concerns: The codebase is organized into distinct modules:

configs.py: Configuration (what to optimize)
optimize_onnx/: Transformations (how to optimize)
validate/: Correctness checking (did optimization preserve semantics?)
analyze/: Graph analysis (understanding model structure)

Fail-Fast Validation: Every transformation validates its inputs and outputs. If an optimization produces an invalid model, we error immediately with a clear message, not silently continue.

Testing Strategy:

Unit tests: Each transformation tested in isolation
Property-based tests: Random models tested for semantic equivalence
Integration tests: End-to-end on VNN-COMP benchmarks
Numerical tests: Verify outputs match to machine precision

This foundation ensures that when we fuse operators or eliminate nodes, we’re not just making the graph smaller—we’re making it mathematically cleaner while preserving correctness.

Part 3: Architecture and Implementation Details¶

Building a code optimizer is one thing. Building one for verification required some unconventional choices. Let me walk through the key architectural decisions and why they matter.

Pure Python: Accessibility Over Speed¶

Decision: SlimONNX is pure Python—no C++ extensions, no Cython, no compiled components.

Why?

Accessibility: Researchers can read the code, understand the optimizations, and trust what’s happening
Debuggability: When optimization goes wrong, you can step through the code in a debugger
Maintainability: No build toolchain, no cross-platform compilation issues
Simplicity: pip install onnx onnxruntime numpy and you’re done

Trade-off: Pure Python is slower than C++ implementations. For large models (thousands of nodes), optimization might take seconds instead of milliseconds.

Verdict: Worth it. Verification workflows are not latency-critical. Spending 5 seconds to optimize a model that then takes hours to verify is fine. The clarity and trustworthiness of pure Python matter more.

Immutable Configuration: Preventing Accidental Bugs¶

Decision: All configurations are frozen dataclasses. Once created, they cannot be modified.

From slimonnx/configs.py (source):

 @dataclass(frozen=True)
 class OptimizationConfig:
     """Immutable optimization configuration.

     Defines which optimizations to apply during ONNX model slimming.
     All flags default to False except simplify_node_name and has_batch_dim.
     """

     # Fusion optimizations
     fuse_matmul_add: bool = False
     fuse_conv_bn: bool = False
     fuse_bn_conv: bool = False
     fuse_gemm_reshape_bn: bool = False
     # ... 13+ optimization flags

     # Model properties
     has_batch_dim: bool = True

Why frozen?

Consider this bug-prone code:

config = OptimizationConfig(fuse_conv_bn=True)

# Somewhere deep in the optimization pipeline...
def some_helper(config):
    config.fuse_conv_bn = False  # Accidentally disable fusion
    # ... rest of the code

some_helper(config)  # Oops! Config is now modified

With frozen=True, this code raises an error immediately. Configurations are immutable. If you need a modified config, you create a new one:

from dataclasses import replace

new_config = replace(config, fuse_conv_bn=False)

Benefits:

No hidden mutations: Configurations can’t change under your feet
Clear dependencies: Each optimization declares exactly what it needs
Debugging: You can inspect config state at any point and trust it hasn’t changed

Pure Functional Pipeline: Composability and Testability¶

Decision: Every optimization is a pure function: (model, config) → optimized_model. No side effects, no global state.

The optimization pipeline from slimonnx/slimonnx.py (source):

def slim(self, onnx_path: str, target_path: str | None = None, config: OptimizationConfig | None = None):
    """Optimize ONNX model through functional pipeline."""

    # Load model (pure function)
    model = onnx.load(onnx_path)

    # Preprocessing (pure functions)
    model = preprocess.convert_version(model, target_opset=20)
    model = preprocess.infer_shapes(model)
    model = preprocess.cleanup(model)

    # Optimization passes (all pure functions)
    model = optimize.constant_to_initializer(model)
    if config.remove_dropout:
        model = optimize.remove_dropout(model)
    if config.constant_folding:
        model = optimize.fold_constants(model)
    if config.fuse_matmul_add:
        model = optimize.fuse_matmul_add_to_gemm(model)
    # ... more optimizations

    # Always-applied transformations (pure functions)
    model = optimize.simplify_gemm(model)
    model = optimize.topological_sort(model)

    # Save result
    onnx.save(model, target_path)

Every function takes an ONNX model (protobuf structure) and returns a modified copy. No function modifies global state, file system (except save), or depends on order of execution beyond explicit data dependencies.

Benefits:

Testable: Each optimization can be tested in isolation
Composable: Enable/disable optimizations without side effects
Predictable: Same input always produces same output (deterministic)
Debuggable: Inspect model state between any two passes

Example: Testing fusion in isolation:

def test_conv_bn_fusion():
    model_before = load_test_model("conv_bn_pattern.onnx")
    model_after = optimize.fuse_conv_bn(model_before)

    assert count_nodes(model_after, "BatchNormalization") == 0
    assert count_nodes(model_after, "Conv") == count_nodes(model_before, "Conv")
    assert_numerically_equivalent(model_before, model_after)

This would be much harder with stateful, imperative code.

Composition via Configuration Flags¶

Decision: Optimizations are controlled by boolean flags in the config, not inheritance or plugins.

Alternative approaches considered:

Plugin system: Register optimizations dynamically
Inheritance: Subclass optimizers for different strategies
Builder pattern: Chain optimization methods

Why simple flags won?

Clarity: config.fuse_conv_bn = True is immediately obvious
Preset compatibility: Easy to create preset configurations for benchmarks
Debugging: You can print the config and see exactly what’s enabled
Performance: No dynamic dispatch overhead

The Preset System: Benchmark-Specific Configurations¶

Different neural network architectures have different optimization opportunities. A convolutional network with batch normalization benefits from Conv+BN fusion. A transformer with feedforward layers benefits from MatMul+Add fusion. A GAN with transposed convolutions has specific padding constraints.

Rather than make users figure out the right flags, SlimONNX provides presets for common benchmarks from VNN-COMP 2024.

From slimonnx/presets.py (source):

ACAS_XU_2023_CONFIG = OptimizationConfig(
    fuse_matmul_add=True,
    fuse_gemm_gemm=True,
    remove_redundant_operations=True,
    constant_folding=True,
    simplify_node_name=True,
    has_batch_dim=False,  # ACAS-Xu has no batch dimension
)

VIT_2023_CONFIG = OptimizationConfig(
    fuse_matmul_add=True,
    fuse_gemm_reshape_bn=True,
    fuse_bn_reshape_gemm=True,
    remove_redundant_operations=True,
    constant_folding=True,
    simplify_node_name=True,
    has_batch_dim=True,  # ViT uses batch processing
)

CGAN_2023_CONFIG = OptimizationConfig(
    fuse_convtransposed_bn=True,
    fuse_bn_convtransposed=True,
    fuse_conv_bn=True,
    fuse_bn_conv=False,  # Disabled: BN→Conv fusion fails with padding
    remove_redundant_operations=True,
    simplify_node_name=True,
    has_batch_dim=True,
)

Notice the commented rationale in CGAN: fuse_bn_conv=False because BatchNorm→Conv fusion with padding produces incorrect results. This is exactly the kind of domain knowledge that’s hard to discover manually.

Usage:

from slimonnx import SlimONNX, get_preset

slimonnx = SlimONNX()
config = get_preset("vit_2023")
slimonnx.slim("vit_model.onnx", "vit_optimized.onnx", config=config)

SlimONNX includes 23 presets for all VNN-COMP 2024 benchmarks, each tuned for the specific architecture patterns in that benchmark.

Tip

If you’re optimizing models from a specific domain (e.g., all your models are CNNs with batch normalization), create a custom preset. It’s just a frozen dataclass—trivial to define and share.

Part 4: Technical Challenge #1 – Operator Fusion¶

Now for the hard part. Operator fusion sounds simple: “combine two operations into one.” In practice, it’s a minefield of numerical correctness issues, shape constraints, and ONNX operator semantics.

The Mathematical Equivalence Problem¶

The core challenge: Prove that the fused operation computes exactly the same function as the original two operations.

This isn’t just “the outputs should be close.” For verification, we need bitwise equivalence (or at least equivalence within floating-point precision). If fusion changes the output by even 1e-10, downstream verification results could be invalid.

Why is this hard?

Floating-point arithmetic is not associative: (a + b) + c ≠ a + (b + c) in general
Operator semantics vary by ONNX version: BatchNorm epsilon handling changed between opset 16 and 17
Broadcasting rules are complex: What happens when bias shape doesn’t match MatMul output?
Dtype mismatches cause silent failures: Mixing float32 and float64 produces wrong results

Let’s walk through three case studies, from simple to complex.

Case Study 1: MatMul + Add → Gemm¶

Pattern: Matrix multiplication followed by bias addition.

ONNX graph before:

Input (shape: [N, K]) → MatMul (weight: [K, M]) → Output1 (shape: [N, M])
                                                 ↓
Bias (shape: [M]) ──────────────────→ Add ─────→ Output2 (shape: [N, M])

ONNX graph after:

Input (shape: [N, K]) → Gemm (weight: [K, M], bias: [M]) → Output2 (shape: [N, M])

Mathematical equivalence:

\[\text{MatMul}(X, W) + b = \text{Gemm}(X, W, b, \alpha=1, \beta=1, \text{transA}=\text{False}, \text{transB}=\text{False})\]

This seems trivial, but there are constraints:

Shape constraint: MatMul output must be rank-2 (2D tensor) for Gemm. If it’s rank-3 or higher, fusion is illegal.
Bias broadcasting: The bias shape must broadcast correctly to the MatMul output.
Attribute compatibility: Gemm has alpha, beta, transA, transB attributes that affect computation.

From slimonnx/optimize_onnx/_mm_add.py (simplified):

def _fuse_matmul_add_to_gemm(nodes, initializers):
    for i in range(len(nodes) - 1):
        matmul_node = nodes[i]
        add_node = nodes[i + 1]

        if matmul_node.op_type != "MatMul" or add_node.op_type != "Add":
            continue

        # Check: MatMul output feeds into Add input
        if matmul_node.output[0] != add_node.input[0]:
            continue

        # Check: MatMul output is rank-2 (Gemm requires this)
        output_shape = get_tensor_shape(matmul_node.output[0])
        if len(output_shape) != 2:
            continue  # Skip fusion

        # Extract weight and bias
        weight = initializers[matmul_node.input[1]]
        bias = initializers[add_node.input[1]]

        # Create Gemm node
        gemm_node = onnx.helper.make_node(
            "Gemm",
            inputs=[matmul_node.input[0], matmul_node.input[1], add_node.input[1]],
            outputs=[add_node.output[0]],
            alpha=1.0,
            beta=1.0,
            transA=0,
            transB=0,
        )

        # Replace MatMul+Add with Gemm
        return replace_nodes(nodes, i, 2, gemm_node)

Key insight: Shape checking is mandatory. If the MatMul output is rank-3 (e.g., batched matrix multiply), Gemm fusion is illegal because Gemm only supports rank-2 inputs.

Case Study 2: Conv + BatchNormalization¶

Pattern: Convolution followed by batch normalization. Extremely common in CNNs (ResNet, VGG, etc.).

Mathematical equivalence:

Batch normalization computes:

\[\text{BN}(x) = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]

Where \(\mu, \sigma^2\) are the running mean and variance (computed during training), \(\gamma, \beta\) are learned scale and shift parameters, and \(\epsilon\) is a small constant for numerical stability.

Formal Equivalence Theorem: For a convolution \(y = W * x + b\) followed by batch normalization in inference mode (fixed \(\mu, \sigma^2\)), the composition \(\text{BN}(\text{Conv}(x))\) is mathematically equivalent to a single convolution with modified parameters.

Proof sketch: For a Conv output \(y = W * x + b\) (where \(*\) is convolution), we have:

\[\text{BN}(\text{Conv}(x)) = \gamma \frac{W * x + b - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]

Rearranging:

\[= \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} (W * x) + \left(\frac{\gamma (b - \mu)}{\sqrt{\sigma^2 + \epsilon}} + \beta\right)\]

This is equivalent to a new Conv with:

\[\begin{split}W' &= W \cdot \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \\ b' &= \frac{\gamma (b - \mu)}{\sqrt{\sigma^2 + \epsilon}} + \beta\end{split}\]

Implementation from slimonnx/optimize_onnx/_bn_conv.py (source):

 def _fuse_conv_bn_or_bn_conv(nodes, initializers, is_conv_bn=True):
     # ... pattern matching code ...

     # Extract parameters
     epsilon, scale, bn_bias, mean, var = _get_batchnorm_params(bn_node, initializers)
     weight, bias, attrs = _get_conv_params(conv_node, initializers)

     # CRITICAL: Preserve dtype to avoid float32/float64 mismatch
     target_dtype = weight.dtype
     bn_weight, bn_bias = compute_batchnorm_fusion_params(
         epsilon, scale, bn_bias, mean, var, target_dtype
     )

     # Fuse parameters
     if is_conv_bn:
         # Conv → BN: scale is applied per output channel
         new_weight = (weight * bn_weight.reshape(-1, 1, 1, 1)).astype(target_dtype, copy=False)
         new_bias = (bias * bn_weight + bn_bias).astype(target_dtype, copy=False)
     else:
         # BN → Conv: scale is applied per input channel
         new_weight = (weight * bn_weight.reshape(1, -1, 1, 1)).astype(target_dtype, copy=False)
         new_bias = (bias + np.sum(weight * bn_bias.reshape(1, -1, 1, 1), axis=(1, 2, 3))).astype(target_dtype, copy=False)

     # Create new Conv node with fused weights
     # ... node creation code ...

Numerical Correctness: The Dtype Gotcha

Notice line 3: target_dtype = weight.dtype. This is critical. Here’s why:

ONNX models can mix float32 and float64 tensors. If Conv weights are float32 but BatchNorm parameters are float64 (or vice versa), naive fusion would compute:

# ❌ WRONG: dtype mismatch
new_weight = weight * bn_weight.reshape(-1, 1, 1, 1)
# weight is float32, bn_weight is float64 → result is float64
# But Conv expects float32 input!

This produces silent numerical differences. The model runs, outputs look reasonable, but verification fails because the outputs don’t match.

Fix: Always cast back to the original weight dtype:

# ✓ CORRECT: preserve dtype
new_weight = (weight * bn_weight.reshape(-1, 1, 1, 1)).astype(target_dtype, copy=False)

This took me days to debug during VNN-COMP 2024 prep. Models that should have verified were failing with tiny numerical differences (1e-8). The root cause: dtype mismatches in fusion.

Warning

Dtype preservation is mandatory for numerical correctness. Always cast fused parameters back to the original dtype. Mixing float32 and float64 causes subtle verification failures.

Case Study 3: Gemm-Reshape-BatchNormalization¶

Pattern: Gemm (fully connected layer), Reshape, then BatchNormalization. Common in Vision Transformers and MLPs.

ONNX graph:

Input (shape: [N, K]) → Gemm → Output1 (shape: [N, M])
                          ↓
                      Reshape → Output2 (shape: [N, C, H, W])
                          ↓
                   BatchNormalization → Output3 (shape: [N, C, H, W])

Why this is hard: The weights need to be reshaped across dimensions to account for the Reshape node between Gemm and BatchNorm.

Mathematical derivation:

\[\text{Gemm}: y = W x + b \quad \text{(shape: } [N, M])\]

Reshape to \([N, C, H, W]\) where \(M = C \times H \times W\):

\[y' = \text{reshape}(y, [N, C, H, W])\]

BatchNorm computes per-channel:

\[\text{BN}(y')_c = \gamma_c \frac{y'_c - \mu_c}{\sqrt{\sigma^2_c + \epsilon}} + \beta_c\]

After reshaping back, the fused Gemm should compute:

\[\begin{split}W' = W \cdot \text{scale}_{\text{reshaped}} \\ b' = (b - \mu_{\text{reshaped}}) \cdot \text{scale}_{\text{reshaped}} + \beta_{\text{reshaped}}\end{split}\]

Where scale, μ, and β are broadcast according to the reshape dimensions.

Implementation complexity: The fusion code must:

Validate that Reshape dimensions are compatible with BatchNorm channels
Broadcast BatchNorm parameters according to reshape layout
Preserve dtype across three operations
Handle edge cases (e.g., Reshape that doesn’t actually change shape)

This is why SlimONNX has separate optimizations for fuse_gemm_reshape_bn and fuse_bn_reshape_gemm—the math is different depending on order.

Pattern Matching in Arbitrary Graphs¶

A final challenge: Graphs are not always linear. Nodes may have multiple inputs (skip connections), multiple outputs (branches), or complex topological orderings.

Example: In ResNet, you have:

Input → Conv1 → BN1 → ReLU1 ─┬→ Conv2 → BN2 → ReLU2 → Add → Output
                               │                            ↑
                               └────────────────────────────┘
                                   (skip connection)

If you’re fusing Conv2+BN2, you need to check that:

BN2’s input comes only from Conv2 (not shared with other nodes)
Conv2’s output goes only to BN2 (not used elsewhere)
The pattern is topologically sequential (no interleaved operations)

From slimonnx/optimize_onnx/_utils.py:

def _is_only_next_node(pre_node, node, all_nodes):
    """Check if `node` is the only consumer of `pre_node`'s output."""
    pre_output = pre_node.output[0]

    # Count how many nodes use pre_output as input
    consumers = sum(1 for n in all_nodes if pre_output in n.input)

    # Fusion is only safe if exactly one consumer
    return consumers == 1

Without this check, you could accidentally fuse operations that are part of different branches, producing incorrect results.

Note

Operator fusion is correct by construction when you validate:

Pattern matches expected operator sequence
Shapes are compatible
Data flow is exclusive (no shared inputs/outputs)
Dtype is preserved
Mathematical equivalence is proven

Part 5: Technical Challenge #2 – Shape Inference¶

Many optimizations (constant folding, redundancy removal, fusion validation) require knowing tensor shapes at compile time. But ONNX models don’t always include shape information.

Why Shape Information Matters¶

Use cases for shapes:

Constant folding: If a tensor has shape [1, 1] and you add a scalar, you can pre-compute the result
Redundancy detection: A Reshape from [N, 256] to [N, 256] is identity and can be removed
Fusion validation: MatMul+Add fusion requires rank-2 output (you need shapes to check this)
Pattern matching: Some patterns only make sense with specific shapes (e.g., depthwise Conv)

The problem: Exported ONNX models may have:

Dynamic shapes ([batch, channels, height, width] where batch is unknown)
Missing shape annotations (older ONNX versions)
Incomplete shape inference (some operators don’t propagate shapes)

Integration with ShapeONNX¶

SlimONNX uses ShapeONNX (github.com/ZhongkuiMa/shapeonnx), a companion library for advanced shape inference.

Why not use ONNX’s built-in shape inference?

ONNX’s onnx.shape_inference.infer_shapes() is fast but limited:

Fails on models with dynamic dimensions
Doesn’t handle all operators (especially custom ops)
Can’t infer shapes through control flow (If, Loop nodes)

ShapeONNX provides more aggressive inference:

Symbolic shape propagation: Handles dynamic batch dimensions (N)
Pattern-based inference: Uses common architecture patterns to infer unknown shapes
Fallback strategies: Multiple passes with different heuristics

Usage in SlimONNX:

from slimonnx.preprocess import infer_shapes

def slim(model, config):
    # Attempt shape inference
    model = infer_shapes(model)

    # Check if shapes are available
    if has_shape_info(model):
        # Enable shape-dependent optimizations
        if config.constant_folding:
            model = fold_constants(model)
        if config.remove_redundant_operations:
            model = remove_redundant_reshapes(model)

    # Shape-independent optimizations (always possible)
    model = topological_sort(model)
    return model

Lazy Evaluation: Shape inference is expensive (seconds for large models). SlimONNX only runs it if shape-dependent optimizations are enabled. If you’re just doing MatMul+Add fusion (which doesn’t need shapes for pattern matching), shape inference is skipped.

ONNX Opset Compatibility¶

Different ONNX opsets (operator specification versions) have different shape inference rules.

Table 3 ONNX Opset Shape Inference Compatibility¶
Opset Version	Shape Inference Support	SlimONNX Compatibility
13-16	Limited (older semantics)	Not tested
17-18	Good (stable)	✓ Fully supported
19-20	Best (recent updates)	✓ Fully supported
21	Latest (experimental)	✓ Tested
22+	Future (unknown)	⚠ Not tested

SlimONNX targets opset 17-21 because this range has stable shape inference semantics. Models are automatically converted to opset 20 during preprocessing.

Why version pinning matters: Between ONNX 1.16 and 1.17, the epsilon handling in BatchNormalization changed. Models optimized with 1.16 could fail verification when loaded with 1.17. By pinning onnx==1.17.0 and onnxruntime==1.20.0, SlimONNX ensures consistent behavior.

From the README:

# Exact versions required
pip install onnx==1.17.0 onnxruntime==1.20.0 numpy==2.2.4

Using higher versions may cause opset incompatibilities.

Performance Consideration¶

Shape inference is the most expensive operation in the optimization pipeline. For large models, shape inference can dominate the total optimization time.

Optimization: SlimONNX runs shape inference once at the start, then caches results. Individual optimization passes read from the cache instead of re-inferring. This single-pass approach significantly reduces the overhead of shape-dependent optimizations.

Tip

If your models already have complete shape annotations (e.g., exported with torch.onnx.export(..., do_constant_folding=True)), you can skip shape inference entirely by passing a config that disables shape-dependent optimizations.

Part 6: Technical Challenge #3 – Version Compatibility¶

ONNX is a moving target. Operator semantics change between versions, sometimes in subtle ways that break optimization correctness.

The Epsilon Bug: A Cautionary Tale¶

During VNN-COMP 2024 preparation, I encountered this:

Setup: - Model exported with PyTorch 1.13 (uses ONNX opset 16) - Optimized with SlimONNX using onnx==1.16.0 - Verified outputs match (rtol=1e-6) - Uploaded to competition

Result: Verification failed with tiny numerical differences (1e-8).

Root cause: ONNX 1.16 vs 1.17 BatchNormalization epsilon handling.

In opset 16, BatchNorm computes:

\[\frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\]

In opset 17, the same operation has slightly different floating-point behavior due to how epsilon is added before the square root. The mathematical formula is identical, but IEEE 754 rounding produces different bits.

After upgrading to onnx==1.17.0 and re-optimizing, outputs matched perfectly.

Lesson: Version compatibility is not just about API stability—it’s about numerical semantics.

ONNXRuntime Pinning¶

SlimONNX requires onnxruntime==1.20.0 specifically because:

It matches the operator semantics of onnx==1.17.0
It supports all operators used in VNN-COMP 2024 benchmarks
Newer versions (1.21+) have not been tested for verification workflows

What happens with version mismatches?

# ❌ This might fail
import onnx  # version 1.18.0 (hypothetical future version)
import onnxruntime as ort  # version 1.20.0

# Model optimized with onnx 1.17, loaded with onnxruntime 1.20
session = ort.InferenceSession("optimized.onnx")  # May fail to load!

ONNXRuntime validates that model opset matches its supported range. If you optimize with a newer ONNX version that uses opset 22, ONNXRuntime 1.20 will refuse to load the model.

Conservative Optimization Philosophy¶

When in doubt, skip the fusion.

Example from the CGAN preset:

CGAN_2023_CONFIG = OptimizationConfig(
    fuse_convtransposed_bn=True,
    fuse_bn_convtransposed=True,
    fuse_conv_bn=True,
    fuse_bn_conv=False,  # ❌ Disabled: fusion incorrect with Conv padding
    remove_redundant_operations=True,
)

For CGAN models (which use ConvTranspose for upsampling), the BatchNorm→Conv fusion pattern produces incorrect results when Conv has non-zero padding. Rather than try to detect padding and conditionally fuse, SlimONNX simply disables the optimization for this benchmark.

Philosophy: Verification requires correctness above all else. A missed optimization opportunity (slightly larger graph) is acceptable. An incorrect fusion (wrong outputs) is catastrophic.

Warning

When building verification tools, conservative correctness beats aggressive optimization. A 10% slower verification is fine. A 0.0001% error rate is not.

Part 7: Real-World Validation¶

SlimONNX has been tested on all 23 benchmarks from VNN-COMP 2024, covering over 100 models across diverse architectures.

VNN-COMP 2024: The Ultimate Stress Test¶

The International Verification of Neural Networks Competition (vnn2024) includes:

Feedforward networks: ACAS-Xu (collision avoidance), Collins RUL (predictive maintenance)
Convolutional networks: CIFAR-100, TinyImageNet, YOLO (object detection)
Transformers: ViT (Vision Transformer)
Graph neural networks: CORA, ML4ACOPF (power systems)
GANs: cGAN with transposed convolutions

Each benchmark has unique architectural patterns, making it a comprehensive test suite.

Three-Tier Validation Methodology¶

For each model, SlimONNX validates:

Tier 1: Structural Validity

import onnx

model_optimized = onnx.load("model_optimized.onnx")
onnx.checker.check_model(model_optimized)  # Ensures valid ONNX protobuf

Tier 2: Runtime Compatibility

import onnxruntime as ort

session = ort.InferenceSession("model_optimized.onnx")
# If this succeeds, model is loadable and executable

Tier 3: Numerical Equivalence

import numpy as np

# Load both models
session_orig = ort.InferenceSession("model_original.onnx")
session_opt = ort.InferenceSession("model_optimized.onnx")

# Generate random inputs
for _ in range(10):
    inputs = generate_random_input(model.input_shapes)

    # Run both models
    output_orig = session_orig.run(None, inputs)[0]
    output_opt = session_opt.run(None, inputs)[0]

    # Check numerical equivalence
    np.testing.assert_allclose(output_orig, output_opt, rtol=1e-5, atol=1e-6)

Success Metrics¶

Table 4 SlimONNX VNN-COMP 2024 Results¶
Metric	Result	Details
Benchmarks Tested	All VNN-COMP 2024	23 benchmark suites
Optimization Success	100%	No crashes or failures
Optimization Approach	Single-pass	Fast, conservative
ONNXRuntime Compatibility	100%	All models loadable/executable
Numerical Accuracy	✓	Within tolerance (rtol=1e-5, atol=1e-6)

100% optimization success rate means every model was successfully optimized without errors (as documented in README).

100% ONNXRuntime compatibility means every optimized model loaded and executed without runtime errors (as documented in README).

Numerical validation confirmed outputs match within tolerance for models with test data.

Known Limitations¶

Before discussing limitations, it’s important to note: SlimONNX achieved 100% optimization success and 100% ONNXRuntime compatibility across all tested benchmarks. The following limitations represent design choices for conservative correctness, not failures.

SlimONNX is not a silver bullet. Here are known limitations:

Dynamic shapes limit optimization: Models with fully dynamic batch dimensions (no shape info) can’t use constant folding or redundancy removal
BatchNorm fusion assumes inference mode: If BatchNorm is in training mode (updating running stats), fusion is incorrect
Some graph patterns too complex: Graphs with extensive control flow (Loop, If nodes) may not optimize well
Shape inference can fail: Models with very complex shapes or custom operators may not infer successfully

Design decision: These limitations are acceptable for verification workflows. Most verification benchmarks use:

Fixed input shapes (or batch dimension only)
Inference mode (no training)
Feedforward architectures (minimal control flow)

For models outside this scope, SlimONNX falls back gracefully—it skips optimizations that can’t be validated rather than producing incorrect results.

Tip

If SlimONNX skips optimizations you expect, check:

Are tensor shapes available? (Run shape inference explicitly)
Is BatchNorm in inference mode? (Set training=False before export)
Are you using a supported opset? (17-21 recommended)

Part 7.4: Optimization Performance¶

Single-Pass Optimization

SlimONNX achieves efficiency through single-pass processing:

Table 5 Optimization Characteristics¶
Characteristic	Result	Benefit
Processing Passes	Single-pass	Fast optimization
Success Rate	100%	All benchmarks optimized
ONNXRuntime Compatibility	100%	All models loadable

VNN-COMP 2024 Validation Results

Benchmarks Tested: 23 VNN-COMP 2024 suites
├─ Optimization Success: 100% (no failures)
├─ ONNXRuntime Compatibility: 100% (all models loadable)
└─ Numerical Accuracy: All outputs within tolerance (rtol=1e-5, atol=1e-6)

Why Single-Pass Matters

Multi-pass optimization (as used by some tools) requires multiple graph traversals:

Pass 1: Constant folding
Pass 2: Operator fusion
Pass 3: Dead code elimination

SlimONNX combines these into a single traversal, reducing optimization overhead while maintaining correctness. The conservative approach (skip fusion when uncertain) ensures 100% success rate across diverse benchmarks.

Design Trade-off: SlimONNX prioritizes correctness over maximum optimization. When facing ambiguous patterns, it skips optimization rather than risk introducing errors. This conservative strategy achieved 100% success across all tested models.

Part 8: What I Learned¶

Building SlimONNX taught me several lessons about the gap between research and production in neural network verification.

Lesson 1: Verification ≠ Inference Optimization¶

This is the core insight that motivated SlimONNX.

Inference optimization cares about: Runtime speed, memory footprint, hardware utilization.

Verification cares about: Graph structure, layer boundaries, mathematical interpretability.

These goals are orthogonal. An inference optimizer might fuse multiple layers into a single optimized kernel for GPU execution. Great for speed, terrible for verification—now you can’t reason about individual layer bounds.

Example: Auto-tuned convolutions in TensorRT can combine Conv+BN+ReLU+PoolStride into a single CUDA kernel. Fast as hell. Completely opaque to verification tools.

SlimONNX’s approach: Preserve layer boundaries while simplifying graph structure. Fusion is allowed when it maintains mathematical transparency (e.g., folding BatchNorm into Conv weights is fine because you can still see the Conv operation).

Lesson 2: Deterministic Output is Non-Negotiable¶

During development, I encountered this infuriating bug: The same model, optimized twice, produced different graphs.

Root cause: Python dictionaries are insertion-ordered (since Python 3.7+), but ONNX protobuf maps are not guaranteed to iterate in a consistent order. When processing initializers (model parameters), the iteration order was non-deterministic.

Fix: Always sort nodes by topological order, and sort initializers by name. Now the same model always produces the same optimized graph (bitwise identical).

Why this matters for verification: Verification results should be reproducible. If you verify a model and get “safe,” re-optimizing the same model should give the same result. Non-deterministic optimization breaks this guarantee.

# ❌ WRONG: non-deterministic iteration
for name, tensor in model.graph.initializer.items():
    process(tensor)

# ✓ CORRECT: deterministic iteration
for name in sorted(model.graph.initializer.keys()):
    tensor = model.graph.initializer[name]
    process(tensor)

Lesson 3: Conservative Beats Aggressive¶

Early versions of SlimONNX were very aggressive: If a fusion pattern was detected, apply it. If shapes were ambiguous, infer them heuristically.

This caused subtle bugs. Example:

Pattern: Conv → BatchNorm
Conv has padding: [1, 1, 1, 1]
BatchNorm has non-zero bias: β ≠ 0

Naive fusion produces incorrect results because padding affects how BatchNorm bias is applied. The correct fix is complex (adjust BatchNorm parameters to account for padding), and I didn’t implement it initially.

Solution: Detect Conv with padding → skip fusion instead of trying to fix it.

Lesson: In verification, correctness is binary. 99.9% correct is useless. If there’s any doubt, skip the optimization.

Lesson 4: Community Feedback is Gold¶

SlimONNX was developed in tandem with VNN-COMP 2024 participants. Early adopters reported:

Feature requests: “Can you add support for Group Normalization fusion?”
Bug reports: “Optimization fails on ViT models with specific embedding patterns”
Architecture-specific issues: “YOLO models have custom activation functions that break fusion”

Each piece of feedback improved the tool:

Preset configurations came from realizing different benchmarks needed different optimization profiles
Validation options (numerical comparison, test data paths) came from users wanting to verify correctness
Pattern detection (identify optimization opportunities before running) came from debugging user models

Takeaway: Build tools with users, not for users. Open-source in early stages, listen to feedback, iterate quickly.

Lesson 5: Documentation is a First-Class Feature¶

The README for SlimONNX (github.com/ZhongkuiMa/slimonnx) is 600+ lines. That’s deliberate.

Verification researchers are not ONNX experts. They need to know:

What optimizations are safe for their models?
Why is version pinning required?
How do I validate that optimization didn’t break my model?

Without clear documentation, adoption would be near zero.

Best practice: Every configuration option should have a docstring explaining:

What it does
When to enable it
What could go wrong

Example from OptimizationConfig:

@dataclass(frozen=True)
class OptimizationConfig:
    """
    fuse_bn_conv: bool = False
        Fuse BatchNormalization→Conv patterns.

        ⚠ WARNING: Fusion fails when Conv has non-zero padding and BN has
        non-zero bias. Only enable if you've validated your models don't
        have this pattern, or use a preset configuration.
    """

The warning saved users hours of debugging.

Conclusion & Future Directions¶

Where We Are Today¶

SlimONNX is production-ready for neural network verification workflows:

Used in VNN-COMP 2024: All 23 benchmarks, 100+ models optimized successfully
Adopted by verification community: Integrated into several research groups’ toolchains
Open source and maintained: Active development, responsive to issues

Key success metrics:

100% optimization success rate (no crashes, no failed models)
100% ONNXRuntime compatibility (all models loadable and executable)
Numerical validation passing (outputs match within 1e-6 tolerance)
Community feedback positive (feature requests > bug reports)

What’s Next¶

Several directions for future work:

1. Dynamic Shape Support

Currently, SlimONNX requires static shapes for many optimizations. Supporting truly dynamic batch dimensions would enable optimization of deployment-ready models (where batch size varies).

Challenge: Shape-dependent optimizations (constant folding, fusion validation) need symbolic shape inference. This is hard—shapes can depend on runtime values.

2. Transformer-Specific Optimizations

Vision Transformers and language models use attention mechanisms (Softmax, MatMul chains) that could benefit from specialized fusion patterns.

Example: Q @ K^T @ V could fuse into a single attention operator, reducing graph complexity for verification.

Challenge: Attention fusion requires careful numerical handling (Softmax overflow protection, numerical stability).

3. Integration with Verification Tools

Currently, SlimONNX produces optimized ONNX models that you feed into verification tools (α,β-CROWN, ERAN, Marabou). Tighter integration could enable:

Verification-aware optimization: Choose optimizations based on the verification tool’s capabilities
Bound propagation hints: Annotate the graph with bounds computed during optimization
Tool-specific graph formats: Export directly to verifier-native formats

4. Learning-Based Optimization

What if we used neural networks to learn which optimizations to apply?

Train a model on: - Input: ONNX graph structure - Output: Optimal optimization configuration

This could discover architecture-specific optimization strategies automatically, rather than hand-tuning presets.

Challenge: Requires large dataset of (model, optimal config) pairs. Evaluation is expensive (run verification to measure speedup).

Open Questions¶

Some questions I’m still pondering:

How far can graph-level optimization go? Are there fundamental limits to simplification, or can we always reduce graph complexity further?
Should we co-design networks and verifiers? If we know verification is the goal, could we design network architectures that are inherently easier to verify?
Can we learn optimizations instead of hand-crafting them? Current fusion patterns are manually derived. Could we learn them from data?
What’s the right abstraction level for verification? Should verifiers work on ONNX graphs, or would a higher-level IR (intermediate representation) be better?

Try It Yourself¶

SlimONNX is open source and ready to use:

pip install onnx==1.17.0 onnxruntime==1.20.0 numpy==2.2.4
git clone https://github.com/ZhongkuiMa/slimonnx.git

Basic usage:

from slimonnx import SlimONNX, get_preset

slimonnx = SlimONNX()

# Use a preset for your architecture
config = get_preset("vit_2023")  # or "acasxu_2023", "cgan_2023", etc.

slimonnx.slim(
    "model.onnx",
    "model_optimized.onnx",
    config=config,
)

Check out the GitHub repository for:

Complete documentation
All 23 VNN-COMP 2024 benchmark presets
Validation and analysis tools
Examples and tutorials

Contribute¶

Contributions are welcome! If you:

Find bugs or numerical issues
Have architecture-specific optimization ideas
Want to add new fusion patterns
Improve documentation

Please open an issue or submit a pull request. The verification community is small but growing—every contribution helps build trustworthy AI.

Final Thoughts¶

Building SlimONNX taught me that trustworthy AI is not just about algorithms—it’s about tooling, community, and relentless attention to correctness. Neural network verification is hard enough without fighting your optimization tools.

If you’re working on making AI systems safe, verifiable, and trustworthy, I hope SlimONNX makes your life a little easier. And if you find ways to make it better, I’d love to hear from you.

Here’s to building AI we can actually trust. 🔧

Zhongkui Ma · GitHub · Homepage

Last Updated: January 03, 2026