Soundness and Completeness

Your verification tool returns “unknown.” You run it again with a different method and get “verified.” What does this mean for your network’s actual robustness? And more importantly, why can’t verification tools just give you a yes-or-no answer like a test suite?

The answer lies in understanding two fundamental properties that govern all verification methods: soundness and completeness. These aren’t just academic concepts—they determine whether you can trust your verification results, how much computational effort you need to invest, and which tools are appropriate for your use case. This guide explores these formal properties and their practical implications.

Formal Definitions

Before diving into the practical implications, let’s establish the formal mathematical definitions from the verification literature.

Robustness Verification:

A robustness verification approach aims to certify the lower bound of a model’s performance against any adversary under certain constraints (e.g., \(\ell_\infty\)-bounded attack). We can categorize verification approaches along two orthogonal dimensions [Katz et al., 2017, Singh et al., 2019, Weng et al., 2018]:

  1. Complete vs. Incomplete: Whether the method can always reach a definitive conclusion

  2. Deterministic vs. Probabilistic: Whether guarantees are absolute or hold with high probability

Definition: Robustness Verification (Formal)

An algorithm \(\mathcal{A}\) is a robustness verification if for any \((x_0, y_0)\), when an adversarial example exists (i.e., \(\exists x \in B_{p,\epsilon}(x_0)\) with \(F_\theta(x) \neq y_0\)), then:

  • Deterministic: \(\mathcal{A}(f_\theta, x_0, y_0, \epsilon) = \text{false}\)

  • Probabilistic: \(\Pr[\mathcal{A}(f_\theta, x_0, y_0, \epsilon) = \text{false}] \geq 1 - \alpha\)

where \(\alpha\) is a pre-defined small threshold (e.g., 0.001).

If \(\mathcal{A}(f_\theta, x_0, y_0, \epsilon) = \text{true}\), we say \(\mathcal{A}\) provides robustness certification for model \(f_\theta\) on \((x_0, y_0)\) against \((\ell_p, \epsilon)\)-adversary.

Complete Verification:

Whenever the verification returns “not verified” (false), if it is guaranteed that an adversarial example exists, we call it complete verification. Otherwise, it is incomplete verification. Complete verification provides both soundness (no false positives) and completeness (no false negatives).

However, complete verification for ReLU networks is NP-complete [Katz et al., 2017, Weng et al., 2018], making it computationally intractable for large networks. This fundamental barrier—the scalability challenge—prevents complete verification from scaling to common DNN sizes.

The Fundamental Trade-off:

As proven in Weng et al. [2018], unless \(\text{NP} = \text{P}\), there is no polynomial-time verification that can guarantee a constant fraction between its certified robustness radius and the optimal radius achievable by complete verification. This means the trade-off between scalability and tightness is not just practical—it’s theoretical and fundamental.

Soundness: The Guarantee You Can Trust

Imagine your verification tool says “this network is robust.” Can you trust that answer? Not always—it depends on whether the tool is sound.

What Soundness Means

A verification method is sound if, whenever it says “the property holds,” the property actually holds for the real network. More formally:

\[\text{Sound method: If verifier says SAFE, then the network is truly SAFE}\]

Sound methods never produce false positives. If the verifier proves your network is safe, you can bank on it.

Why Soundness Matters

Safety-critical applications—autonomous vehicles, medical devices, aviation systems—cannot tolerate false positives. A verification tool that occasionally lies about safety is worse than useless; it’s dangerous. Sound methods provide the mathematical guarantee that “if we say it’s safe, it’s actually safe.”

Verification vs Falsification

Adversarial attacks [Carlini and Wagner, 2017, Goodfellow et al., 2015, Madry et al., 2018] are falsifiers, not verifiers. They aim to find counterexamples (adversarial examples) rather than prove their absence. When an attack finds an adversarial example, it’s a real counterexample—you can verify by running the network on that input. This makes attacks sound for falsification: any bug found is real. However, when an attack fails to find adversarial examples, you’ve learned nothing about whether they exist—attacks are incomplete for falsification. A more clever attack might succeed where yours failed, but that doesn’t make the successful attack’s findings “unsound.”

By contrast, methods like Interval Bound Propagation (IBP) [Gowal et al., 2019] and CROWN [Weng et al., 2018] are sound. When they certify a network as robust, that certification is mathematically valid—no stronger attack will break it within the proven bounds.

The Cost of Soundness

Soundness comes with a price: conservative approximations. Sound methods must account for all possible behaviors in their analysis region, so they typically produce loose bounds. These conservative bounds might say “unknown” for regions where the network is actually safe, but the verifier simply can’t prove it with its approximations.

When Soundness is Non-Negotiable

Use sound verification methods when failures have serious consequences: safety-critical systems, regulatory compliance, certification requirements. Use adversarial attacks when speed matters more than proofs: development iteration, rapid prototyping, finding obvious bugs. Remember: attacks are sound for bug-finding (any bug found is real), just incomplete (may miss bugs) and unable to prove robustness.

Hint

Recognizing verification vs falsification tools: Verifiers (CROWN, Marabou, ERAN) prove properties hold—they’re sound for verification. Falsifiers (adversarial attacks, fuzzing) find bugs—they’re sound for falsification but can’t prove robustness. Read papers carefully—some tools claim verification soundness while using heuristics that violate it.

Completeness: Finding All the Answers

Even if a verification method is sound, it might not always give you a definite answer. Some verifiers can return “unknown”—meaning they couldn’t determine whether the property holds or not.

What Completeness Means

A verification method is complete if it can always determine whether a property holds or not. No “unknown” results—just “safe” or “unsafe.”

\[\text{Complete method: } \forall x \in \text{Input Domain}, \text{ decides SAFE or UNSAFE}\]

Complete methods always reach a conclusion, though they might take arbitrarily long to do so.

Complete vs Incomplete Methods

Complete verification methods include SMT solvers (Marabou [Katz et al., 2017]), MILP solvers [Tjeng et al., 2019], and branch-and-bound algorithms with sufficient refinement. These methods can theoretically answer any verification query with certainty, but they may require significant computational effort.

Incomplete methods like CROWN [Weng et al., 2018] and DeepPoly [Singh et al., 2019] use approximations that, while sound, can be too conservative to decide on some inputs. When they can’t determine the answer within their approximations, they return “unknown.”

The “Unknown” Result

When an incomplete verifier returns “unknown,” it doesn’t mean the network is unsafe. It means the verifier’s approximations weren’t tight enough to prove safety, and the method isn’t designed to exhaustively search for counterexamples like SMT solvers do. Think of “unknown” as “I can’t prove it from where I’m standing, but that doesn’t mean you’re wrong.”

Computational Cost

Completeness is expensive. Complete methods must handle the worst case: exploring a potentially infinite space of possibilities. They use refinement strategies like branch-and-bound—splitting the input space into smaller regions and analyzing each one recursively. This branching can explore exponentially many regions, making completeness impractical for large networks or large input regions.

Understanding “Unknown” Results

An “unknown” result from an incomplete verifier is not failure. It reflects the inherent limitation of using approximations. Rather than seeing it as “maybe unsafe,” interpret it as “requires more computational resources or a tighter method to determine.”

Property

Complete Methods

Incomplete Methods

Always decides?

Yes (theoretically)

No, may return “unknown”

Sound?

Yes

Yes

Speed

Variable, can be very slow

Fast, predictable

Scalability

Limited (small networks)

Good (moderate to large networks)

Examples

SMT solvers, MILP, branch-and-bound

CROWN, DeepPoly, IBP

The Four Quadrants: Combining Soundness and Completeness

These two properties are independent—you can be sound without being complete, complete without being sound, or neither. This creates a matrix of verification approaches.

Sound + Complete: The Gold Standard

SMT solvers and MILP-based verifiers occupy this quadrant. If they terminate, their answer is definitive and correct. “Verified” means truly verified; “counterexample found” provides a real adversarial example. The tradeoff: they’re expensive and don’t scale well to large networks. These methods are reserved for final certification of critical systems or small networks.

Sound + Incomplete: The Practical Choice

Most modern neural network verifiers (CROWN, DeepPoly, IBP) live here. They’re sound—if they say “verified,” it’s true—but may return “unknown.” They’re much faster than complete methods, making them practical for larger networks. Most verification research focuses on tightening these methods, shrinking the “unknown” region.

Unsound Verification: Rare Territory

Some approximation methods might sacrifice verification soundness for speed, but this is rarely acceptable—if a verifier might lie about safety, its guarantees are meaningless. Most practical methods prioritize soundness over completeness.

Sound Falsification, Incomplete Coverage: The Attacker’s Domain

Adversarial attacks, fuzzing, and testing are falsifiers. They find bugs rather than prove their absence. Key properties:

  • Sound for falsification: Any bug or counterexample found is real (verifiable by execution)

  • Incomplete: May miss existing bugs

  • Cannot verify: Not finding a bug doesn’t prove robustness

These methods are fast and valuable for development—they find real bugs quickly. The limitation is they can’t prove safety, not that their findings are unreliable.

Complete

Incomplete

Sound

SMT, MILP (gold standard)

CROWN, DeepPoly (most practical)

Falsification

N/A (falsifiers don’t aim for completeness in verification sense)

Adversarial attacks (sound for bug-finding, cannot prove safety)

No Free Lunch: Tradeoffs Are Fundamental

You cannot have soundness, completeness, and speed simultaneously for large networks. Practical verification requires choosing priorities: sound + complete + slow, or sound + fast + incomplete. Alternatively, use falsification methods (attacks) which are fast and find real bugs but cannot prove safety. Your application determines which approach to use.

Deterministic vs. Probabilistic Verification

Beyond the sound/complete dichotomy, verification methods also differ in whether they provide deterministic or probabilistic guarantees.

Deterministic Verification:

Traditional verification methods (SMT solvers [Katz et al., 2017], MILP [Tjeng et al., 2019], bound propagation [Singh et al., 2019, Weng et al., 2018]) are deterministic: given the same input, they always produce the same result. When a deterministic verifier says “verified,” the property holds with certainty, not just high probability.

Probabilistic Verification:

Randomized smoothing [Cohen et al., 2019] represents a fundamentally different paradigm. Instead of verifying the original network \(f_\theta\), it verifies a smoothed version \(g(x) = \mathbb{E}_{noise}[f_\theta(x + noise)]\) where noise is sampled from a distribution (typically Gaussian). The verification guarantee holds with high probability (e.g., 99.9%) where the randomness is independent of the input.

Probabilistic Certification Guarantee

For randomized smoothing [Cohen et al., 2019], the certification states: “With probability at least \(1 - \alpha\) (e.g., \(\alpha = 0.001\)), the smoothed classifier \(g\) is robust within radius \(\epsilon\) at input \(x_0\).”

This is different from deterministic methods which state: “The network \(f_\theta\) is provably robust within radius \(\epsilon\) at \(x_0\).”

Trade-offs:

  • Deterministic methods: Stronger guarantees (certainty vs. high probability), but typically worse scalability or looser bounds

  • Probabilistic methods: Better scalability to large networks and larger perturbations, but guarantees hold only with high probability and apply to smoothed classifiers rather than the original network

Importantly, all existing probabilistic verification approaches are incomplete [Cohen et al., 2019]. The sound/complete and deterministic/probabilistic dimensions are orthogonal, creating a richer taxonomy of verification approaches.

Soundness in Practice: Over-Approximation

How do sound methods achieve their guarantees? They rely on over-approximation.

Imagine computing the set of all possible outputs your network can produce given an input region. This “reachable set” is what we want to analyze. But computing it exactly is intractable. Sound methods instead compute an approximation that contains the true set:

\[\text{Approximated Set} \supseteq \text{True Reachable Set}\]

By containing the true set, any property that holds for the approximation holds for the true outputs. This is the foundation of soundness.

Interval Arithmetic as an Example

The simplest sound method is interval arithmetic: track intervals at each neuron. For input interval [l, u], compute output intervals layer by layer. Each neuron’s output is bounded by the minimum and maximum over its possible inputs.

This is sound because the computed intervals always contain the true outputs. But it’s conservative—the intervals grow as you propagate through layers, becoming looser and looser. Eventually, the intervals become so large that they provide no useful information.

Tightness vs Soundness

Sound methods must over-approximate, so all sound methods are conservative to some degree. But they vary in how conservative they are. Tight methods produce approximations close to the true set (good for proving robustness). Loose methods are very conservative (might say “unknown” when safety actually holds).

Importantly, tightness and soundness are orthogonal properties. A method can be: - Sound and tight (CROWN on favorable networks) - Sound and loose (interval arithmetic) - Sound falsification and effective (adversarial attacks that cleverly find real weaknesses)

Tip

When choosing between sound methods, prioritize tightness. A sound but loose verifier may waste time returning “unknown” for actually-safe regions. Tighter methods (CROWN) are preferable to looser methods (interval arithmetic) when computational budget allows.

Completeness in Practice: Refinement and Branching

Complete methods guarantee they’ll reach a definitive answer, but how? Through refinement strategies.

Branch and Bound

The classic approach is branch-and-bound: recursively split the input domain into smaller regions. For each region, apply an incomplete method (like CROWN) to get an upper bound on the loss or output value. If the bound proves the property holds, that region is done. If not, split it further and recurse.

Eventually, either all regions are verified (property holds globally) or you find a region where the bound suggests checking for counterexamples (property fails). This exhaustive exploration guarantees completeness.

Abstraction Refinement

SMT solvers use a different strategy: abstract interpretation with refinement. They start with coarse abstractions of the network, verify against those, and if verification fails, refine the abstraction to be more precise. They iterate until they either prove the property or find a real counterexample.

Why Completeness is Expensive

The problem is exponential blow-up. With branch-and-bound, the recursion tree can grow exponentially in depth and branching factor. For a modestly sized network and input region, the number of regions to analyze can become astronomical.

Practical Completeness: Timeouts

In practice, “complete” methods hit computational limits. They might have a timeout or memory limit. When the limit is reached, they might return “unknown” (making them incomplete in practice) or keep the best answer found so far. True completeness is theoretically guaranteed but practically limited by computational resources.

Hint

Only invest in complete verification when necessary: small networks, small input regions, or when certification is mandatory. For most development and testing, incomplete sound methods suffice and are far more practical.

Real-World Implications: Choosing Your Guarantees

How do these properties affect your actual verification workflow?

Safety-Critical Systems

Autonomous vehicles, medical devices, aviation systems—these require sound methods. The consequences of false positives (reporting something as safe when it’s not) are unacceptable. Whether you need completeness depends on regulations and liability. Often, sound + incomplete is sufficient if the certified regions are large enough.

Development and Research

During model development, adversarial attacks are perfectly appropriate. They quickly find real bugs and guide improvements. The lack of verification guarantees doesn’t matter because you’re not relying on attacks to certify the final product—you’re using them to find and fix problems. Any bug found by an attack is real and actionable.

Regulatory Compliance and Certification

Some regulations (aviation, medical) may require formal guarantees. Sound verification is necessary. Completeness may be required depending on the standard, though practical incomplete methods increasingly meet regulatory requirements when the verified regions are comprehensive enough.

Performance Tradeoffs

Your choice of soundness and completeness is really a choice about tradeoff priorities: - Need fast bug-finding? Use adversarial attacks (fast, finds real bugs, but can’t prove safety) - Need practical robustness without absolute guarantees? Use sound, incomplete (CROWN, DeepPoly) - Need certification for critical systems? Use sound, complete (SMT solvers, though slower)

Scenario

Soundness

Completeness

Reasoning

Research/development

Not required

Not required

Speed and iteration matter more

Production robustness

Required

Not required

Need guarantees but “unknown” is acceptable

Safety-critical certification

Required

Often required

Regulations may demand completeness

Regulatory compliance

Required

Case-by-case

Depends on specific standards

Common Misconceptions

Misconception 1: “Unknown means unsafe”

False. “Unknown” from an incomplete verifier means the verifier couldn’t prove safety with its approximations. The network might be perfectly safe. It just means you need tighter bounds, a smaller input region, or a complete method to determine the answer.

Misconception 2: “Sound methods are always better”

Not always. Sound verification methods are necessary when you need safety proofs, but adversarial attacks are faster for finding bugs. For development iteration, attacks quickly find real issues that verification methods would take hours to process—and any bug found is actionable.

Misconception 3: “Complete methods always terminate quickly”

No. “Complete” means they’ll eventually reach a conclusion, but that conclusion might take unbounded time. In practice, timeouts make them incomplete, or they return the best answer found so far.

Misconception 4: “Tight bounds are unsound”

These properties are independent. A method can be both sound and tight (good!), or sound and loose (conservative). Tightness measures approximation quality; soundness measures correctness.

Misconception 5: “All verifiers are sound”

Many tools make unsound approximations or use unsound-by-default methods. Some papers claim soundness while using heuristics that violate it. Always check whether the tool’s guarantees are proven or just claimed.

Critical Thinking About Tool Claims

When evaluating verification tools, ask: Is soundness proven mathematically, or just claimed? Does completeness mean theoretical completeness (eventually, with infinite time) or practical completeness (under stated assumptions)? Does “unknown” mean the method is incomplete, or does the method not return “unknown” at all? The answers determine whether you can trust the tool for critical applications.

Final Thoughts

Soundness and completeness are orthogonal properties that define the guarantees verification methods provide. Understanding them helps you choose appropriate tools for your use case:

  • Sound + Complete (SMT solvers): Highest assurance, highest cost

  • Sound + Incomplete (CROWN, DeepPoly): Practical verification, most common

  • Falsification methods (Adversarial attacks): Fastest for bug-finding, sound when bugs found, cannot prove safety

  • Unsound verification (rare): Methods that might give false “verified” results—avoid for safety-critical applications

The uncomfortable truth: for large networks, you cannot have soundness, completeness, and speed simultaneously. You must prioritize. Safety-critical applications prioritize soundness. Research prioritizes speed. Practical systems prioritize the balance between sound guarantees and computational feasibility.

The field is actively developing better algorithms. Tighter bounds for incomplete methods reduce “unknown” results. Faster complete methods make exhaustive verification more practical. The gap between sound and complete narrows gradually, but the fundamental tradeoffs remain.

Understand these properties, choose tools accordingly, and your verification efforts will be far more effective and trustworthy.

Further Reading

This guide provides a comprehensive overview of soundness and completeness in neural network verification. For readers interested in diving deeper, we recommend the following resources organized by topic:

Foundational Theory:

The theoretical foundations of soundness in verification trace back to abstract interpretation. The fundamental definitions and guarantees are established in the verification literature [Katz et al., 2017, Weng et al., 2018].

Complete Verification Methods:

Complete SMT-based verification is exemplified by Reluplex [Katz et al., 2017] and its successor Marabou. MILP-based approaches [Tjeng et al., 2019] provide an alternative complete verification strategy. Recent advances in complete verification combine branch-and-bound with tight incomplete methods for improved scalability.

Incomplete Verification Methods:

Modern incomplete but scalable verification methods include CROWN [Weng et al., 2018], DeepPoly [Singh et al., 2019], and IBP [Gowal et al., 2019]. These methods sacrifice completeness for computational efficiency, making them practical for larger networks.

Probabilistic Verification:

Randomized smoothing [Cohen et al., 2019] provides a fundamentally different approach, offering probabilistic certification guarantees with better scalability to large perturbations.

Complexity and Theoretical Barriers:

The NP-completeness of neural network verification [Katz et al., 2017, Weng et al., 2018] establishes fundamental limits on what can be achieved efficiently. Understanding these theoretical barriers helps set realistic expectations for verification tools.

Related Topics:

For practical implementation details, see Bound Propagation Approaches and Robustness Testing Guide. For understanding the broader context of certified robustness, see Certified Defenses and Randomized Smoothing and Training Robust Networks.

Next Guide

Continue to Verification Taxonomy: A Systematic View to explore the different categories and approaches in neural network verification methods.

[1]

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), 39–57. IEEE, 2017.

[2] (1,2,3,4)

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning. 2019.

[3]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations. 2015.

[4] (1,2)

Sven Gowal, Krishnamurthy Dj Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. Scalable verified training for provably robust image classification. In Proceedings of the IEEE International Conference on Computer Vision, 4842–4851. 2019.

[5] (1,2,3,4,5,6,7)

Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: an efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, 97–117. Springer, 2017.

[6]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations. 2018.

[7] (1,2,3,4)

Gagandeep Singh, Rupanshu Ganvir, Markus Püschel, and Martin Vechev. Beyond the single neuron convex barrier for neural network certification. In Advances in Neural Information Processing Systems, 15072–15083. 2019.

[8] (1,2,3)

Vincent Tjeng, Kai Y. Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations. 2019.

[9] (1,2,3,4,5,6,7,8,9)

Lily Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane Boning, and Inderjit Dhillon. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, 5276–5285. 2018.