Verification Benchmarks and Evaluation

Verification Benchmarks and Evaluation#

You developed a verification method. How do you know it’s actually better than existing ones? Without standardized benchmarks, claims become meaningless. Different tools might excel on different problem types, time out on specific architectures, or fail silently on edge cases. Understanding how verification tools are evaluated helps you assess claims critically and advance the field systematically.

Why Standardized Benchmarks Matter#

Benchmarks drive progress. When everyone evaluates using the same formats, instances, infrastructure, and scoring, direct comparison becomes possible. Without standardization, research fragments into incomparable claims where each paper uses custom evaluation criteria.

Reproducibility depends on standardization. If a tool is evaluated on a custom dataset with custom metrics and infrastructure, others cannot easily reproduce or verify the results. Standardized benchmarks—with publicly available instances, shared execution platforms, and documented scoring—make research reproducible and trustworthy.

Fairness requires equal conditions. Comparing a GPU-accelerated tool on a high-end GPU against a CPU tool on commodity hardware is meaningless. Standardized evaluation infrastructure—same hardware, same time limits, same precision tolerances—ensures fair comparison based on algorithmic merit rather than engineering choices.

Progress tracking requires consistency. Benchmarks that change every year or use different instances make it impossible to track cumulative progress. Standardized benchmarks allow new tools to be compared against historical baselines, revealing genuine improvements versus minor variations.

Standardization Enables Fair Evaluation

Standard formats (ONNX networks, VNN-LIB specifications), shared execution platforms, and transparent scoring remove excuses and reveal which methods genuinely work across diverse problem types. They’re essential for the field’s credibility.

The Verification Benchmark Landscape#

Two major competitions dominate neural network verification benchmarking, each with distinct focus and value.

VNN-COMP (Verification Neural Network Competition): The primary general-purpose verification competition, held annually as part of the International Symposium on AI Verification (SAIV). VNN-COMP evaluates tools on feedforward networks, recurrent networks, and diverse properties across multiple domains. It uses standardized formats (ONNX networks, VNN-LIB specifications) and equal-cost cloud infrastructure to ensure fair comparison.

ARCH-COMP AINNCS (AI-enabled Neural Network Control Systems): Specialized competition focusing on neural networks embedded in cyber-physical systems. While VNN-COMP tests static network analysis, ARCH-COMP tests networks as feedback controllers in closed-loop systems with continuous dynamics. This is essential for safety-critical control applications like robotics and autonomous vehicles.

Beyond these major competitions, adversarial robustness leaderboards and domain-specific benchmarks (medical imaging, autonomous vehicles, etc.) complement formal verification efforts.

This guide focuses primarily on VNN-COMP as the standard, then covers ARCH-COMP and practical use of benchmarks.

VNN-COMP: The Primary Benchmark#

Competition Objective: VNN-COMP evaluates neural network verification methods on their ability to (1) verify or falsify as many properties as possible within computational budgets, and (2) scale to diverse problem types. The competition measures both precision (correctness of results) and scalability (handling diverse benchmarks), identifying challenges for the community.

Two-Track Structure: VNN-COMP uses a dual-track system to balance generality with specialization:

Regular Track: Benchmarks where at least 50% of participating tool authors vote to include. These represent problem types where multiple tools can meaningfully compete.
Extended Track: Benchmarks with at least 1 vote not in the regular track. These showcase specialized capabilities or emerging domains.
Unscored: Benchmarks without any tool votes are described in the final report but not scored.

This structure incentivizes tools to support diverse benchmarks (only 1 tool’s vote required for extended track inclusion) while recognizing that not every tool handles every domain.

Key Datasets in VNN-COMP#

MNIST: The simple baseline. Handwritten digits, small images (28×28), low-dimensional input space. MNIST verification is solved—all tools handle it. Use it for prototyping and sanity checks.

CIFAR-10: Object classification standard. Larger images (32×32), more realistic than MNIST. CIFAR-10 has become the standard benchmark for comparing verification methods on vision. Most recent methods publish CIFAR-10 results.

ACAS Xu: Collision avoidance benchmark from aviation. Small network (300 neurons across 6 layers), but challenging due to specific properties and formal specification rigor. Important for safety-critical systems validation.

Extended Benchmarks: VNN-COMP regularly includes extended benchmark tracks for emerging domains. These test novel network architectures, control problems, and application-specific properties.

Benchmark	Characteristics	Verification Difficulty	Primary Use
ACAS Xu	Collision avoidance (small network, rigorous formal properties)	Hard (despite small size)	Safety-critical aviation systems
MNIST	Digit classification (small images, simple task)	Moderate to easy	Baseline comparisons, prototyping
CIFAR-10	Object classification (moderate resolution)	Moderate	Standard vision benchmark
Extended Benchmarks	Diverse: control, learning-enabled systems, novel architectures	Varies (often hard)	Emerging application domains

VNN-COMP Standardized Formats and Infrastructure#

Standardized Formats: VNN-COMP standardizes evaluation using:

ONNX for neural network models (reproducible across platforms)
VNN-LIB for formal specifications (https://www.vnnlib.org/)
Standardized output format: plaintext results with witnesses for counterexamples

Equal-Cost Hardware Infrastructure: Tools choose one of three AWS instance types with roughly equivalent hourly costs:

CPU: m5.16xlarge (64 vCPU, 256 GB memory)
GPU: p3.2xlarge (1x V100 GPU, 8 vCPU, 61 GB memory)
Balanced: g5.8xlarge (32 vCPU, 128 GB memory)

This cost-equivalence approach allows fair comparison between CPU-only and GPU-accelerated tools without penalizing either approach.

VNN-COMP Scoring and Evaluation#

Scoring Mechanism: Each instance is scored based on correctness:

Correct result (unsat or sat): 10 points
Incorrect result or penalty: -150 points
Timeout/Error/Unknown: 0 points

Benchmark score = (sum of instance scores for tool / maximum instance score achieved by any tool) × 100. This creates a percentile-based system where each benchmark is equally weighted regardless of instance count.

Runtime Constraints:

Per-instance timeout: Set by the benchmark proposer (varies by instance difficulty)
Per-benchmark ceiling: Total runtime for all instances in a benchmark is capped at 6 hours

This allows benchmarks to have many short instances or fewer long instances while maintaining consistent total evaluation time.

VNN-COMP Evaluation Phases#

Pre-Competition: Tool authors prepare three bash scripts (install, prepare instance, run instance) following standardized interface
Rules Discussion: Community proposes rules changes, voted on by participating tools
Benchmark Solicitation: Benchmarks proposed, tool participants vote on which to score
Unofficial Measurement: Tool authors can test their tool on all benchmarks multiple times with known seed
Official Measurement: Organizers run all tools on all benchmarks with fresh random seed
Reporting: Participants contribute to summary report; discrepancies can be reported and updated

Notable Patterns: Across VNN-COMP iterations, tools using incomplete but fast approximation methods (like alpha-beta-CROWN) often rank highly due to scalability. Complete methods solve specific instances with certainty but don’t handle large benchmarks. Different tools excel on different problem types—this specialization is expected and healthy. The competition reveals where verification research excels and where new challenges remain.

Note

VNN-COMP resources and results:

Official website and measurement platform: https://vnncomp.christopher-brix.de/
VNN-LIB specification language: https://www.vnnlib.org/
VNN-COMP 2025 Rules: https://sites.google.com/view/vnn2025
VNN-COMP 2024 Benchmarks: ChristopherBrix/vnncomp2024_benchmarks
VNN-COMP 2024 Results: ChristopherBrix/vnncomp2024_results
VNN-COMP 2024 Summary Report: https://arxiv.org/abs/2412.19985

Following the annual competition provides current perspective on verification progress, emerging challenges, and tool capabilities.

ARCH-COMP: Benchmarking Control Systems#

While VNN-COMP focuses on feedforward and general neural network verification, the ARCH-COMP competition includes a specialized category for neural networks embedded in control systems.

ARCH-COMP AINNCS (AI-enabled Neural Network Control Systems): Part of the ARCH (Applied Research on Cyber-Physical systems) verification competition, the AINNCS category specifically evaluates formal verification of neural networks used as feedback controllers in closed-loop systems. This is distinct from static network analysis—it tests how tools handle neural networks embedded in cyber-physical systems with continuous dynamics.

Scope: AINNCS benchmarks verify properties of hybrid systems where neural networks act as controllers. This includes stability properties, constraint satisfaction, and safety specifications for systems with both continuous and discrete components. Participating tools include CORA, CROWN-Reach, GoTube, JuliaReach, and NNV.

When to Use ARCH-COMP: Choose ARCH-COMP benchmarks and tools when your application requires verifying neural networks within control loops. This is essential for robotics, autonomous vehicles, industrial control systems, and other cyber-physical applications where the network is part of a feedback system.

Other Specialized Benchmarks: Beyond VNN-COMP and ARCH-COMP, domain-specific benchmarks focus on particular applications. Medical imaging verification benchmarks test diagnostic networks under realistic imaging variations. Autonomous vehicle benchmarks test perception and control networks. These specialized benchmarks drive progress on specific domain challenges.

Evaluation Metrics and Methodology#

Regardless of which competition you follow, several metrics matter when comparing verification tools.

Certified accuracy: For adversarial robustness, how much accuracy is maintained when you only count certified inputs? Clean accuracy and certified robustness accuracy are often different—a network with high clean accuracy might have much lower robustness certification. The gap reveals how robust the network truly is. Certified accuracy is the meaningful metric for assurance.

Average certified radius: For robustness verification, what radius can be certified on average? Smaller radius means weaker guarantee. Larger radius means stronger guarantee.

Verification time: How long does verification take? Some tools are fast (seconds) and some are slow (hours). For practical use, time matters.

Memory usage: How much memory is required? Tools that run out of memory on large networks are useless regardless of algorithm quality.

Timeout rate: How often does the tool timeout before reaching a conclusion? Higher timeout rate means tool is unreliable for practical use.

Tightness of bounds: For incomplete methods, how close are the bounds to the truth? Two methods might both say “unknown,” but one might be within a small factor of the true value while the other is far off. Bound tightness affects how useful the “unknown” result is.

\[ \begin{align}\begin{aligned}\text{Certified Accuracy} = \frac{\text{# inputs verified as safe}}{\text{# total inputs}}\\\text{Average Certified Radius} = \frac{1}{n} \sum_{i=1}^{n} r_i\end{aligned}\end{align} \]

Computing these metrics consistently allows comparing tools fairly.

Using Benchmarks: Comparing Tools Fairly#

When you want to compare verification tools using benchmarks, several best practices ensure fair and meaningful results.

Control variables: Ensure comparisons are on the same hardware, with same time limits, same memory limits. Different hardware or limits make results non-comparable.

Use same specification format: Different tools might parse specifications differently. Use VNNLIB or other standard formats to avoid misunderstandings.

Run multiple times: Some tools have randomness. Report average results and confidence intervals, not single runs.

Reasonable time limits: Very short limits (seconds) might prevent good methods from finishing. Very long limits (hours) make comparison impractical. Use limits that are reasonable for your application.

Report all results: Don’t cherry-pick results. If Tool A solves more instances than Tool B on some benchmarks, report both. Don’t ignore Tool B’s strong performance on specific categories.

Context matters: A method might perform better on small networks but worse on large ones. Look at performance across categories, not just overall rank.

Specialization is fine: Some methods specialize in specific problem classes (e.g., ReLU networks only). Specialization isn’t failure—it’s reasonable engineering. Just be honest about scope.

Common Comparison Pitfalls

Comparing on different hardware (GPU vs CPU)
Using different time limits for different tools
Reporting only instances where your method excels
Using unnecessarily tight specifications to favor your tool
Ignoring qualitative differences (verified vs heuristic)
Generalizing from one benchmark to all domains

Hint

When reading verification papers, look at the full results table, not just highlighted numbers. Look at performance across categories. If results are reported only on categories where the method excels, and omit categories where it struggles, be skeptical.

Creating Benchmarks: A Guide for Proposers#

For domain-specific evaluation or contribution to competitions, creating new benchmarks requires care. Here’s what benchmark proposers should consider.

Select representative networks: Choose architectures and sizes that represent your domain. Don’t use only toy networks or only production-scale models—use a spectrum that shows meaningful gradation.

Design meaningful properties: Properties should reflect actual requirements, not arbitrary constraints. For medical imaging, properties should relate to diagnostic safety. For autonomous vehicles, to collision avoidance. For control systems, to stability or constraint satisfaction.

Calibrate difficulty: Include easy, medium, and hard instances. If all instances are trivial, the benchmark doesn’t differentiate tools. If all are intractable, it’s useless for evaluation.

Document specifications precisely: Write properties in a standard format (VNNLIB for VNN-COMP, appropriate format for your domain). Ensure they’re interpretable by multiple tools. Ambiguous specifications lead to meaningless results.

Make benchmarks reproducible: Use seed-based generation so any verifier can generate the same instances. Provide a generate_properties.py script (for VNN-COMP) that accepts a seed parameter and generates instance lists deterministically.

For Competition Submission (VNN-COMP): Proposals must include:

A generate_properties.py script that accepts a seed parameter to generate instances deterministically
.onnx files (neural network models)
.vnnlib files (formal specifications)
Ability to randomly select/generate instances based on seed

Tip

Start small. A benchmark with dozens of instances suffices for initial evaluation. Grow it as needed. Large benchmarks are valuable, but maintaining them requires effort. Test your benchmark on at least one existing verification tool before proposing it.

Interpreting Benchmark Results#

Benchmark numbers don’t tell the whole story. Understanding context is essential.

Generalization concerns: Strong performance on standard benchmarks doesn’t guarantee performance on your custom domain. Benchmarks reflect the community’s focus, which might differ from your needs. Always validate on domain-specific instances.

Beware of overfitting: Tools can be “tuned” to perform well on specific benchmarks while doing poorly elsewhere. If a tool seems suspiciously good on one benchmark but mediocre elsewhere, skepticism is warranted.

Compare across time: VNN-COMP benchmarks provide year-over-year comparison. Looking at how a tool performs over multiple years reveals whether improvements are genuine or stagnant.

Understand tool strategies: Different tools use fundamentally different approaches. Incomplete methods trade completeness for speed. Complete methods sacrifice scalability for certainty. Compare tools that use compatible strategies—comparing a complete SMT solver to an incomplete abstraction method is apples-to-oranges.

Future Benchmarks and Evolution#

Benchmarks continue evolving to meet research advancement and practical needs:

Larger networks: Benchmarks increasingly include production-scale networks, pushing scalability limits
Diverse architectures: Transformers, GNNs, recurrent networks, and hybrid models test beyond feedforward assumptions
Complex properties: Beyond robustness to fairness, interpretability, and domain-specific safety constraints
Hybrid and continuous systems: Integration with classical control theory and cyber-physical system verification
Practical deployment: Benchmarks incorporating real-world constraints like numerical precision, quantization, and hardware limitations

Final Thoughts#

Benchmarks are essential for progress. They enable fair comparison, reproducibility, and clear progress tracking. When evaluating verification tools and methods, use standard benchmarks for baseline comparisons. When creating custom evaluation, be rigorous about specification, control variables, and honest reporting.

Remember: benchmarks measure what they measure. A tool that excels on benchmarks might still fail in your application. Benchmarks are valuable guides, not absolute truth. Use them wisely, understand their limitations, and supplement with domain-specific testing.