Soundness and Completeness

Soundness and Completeness¶

January 2026 NNV Guide 15 min read 2,901 words

Zhongkui Ma This guide is based on SoK: Certified Robustness for Deep Neural Networks. It covers the fundamental concepts of soundness and completeness in verification, including complete vs. incomplete methods and deterministic vs. probabilistic approaches.

Foundations Soundness Completeness

Your verification tool returns “unknown.” You run it again with a different method and get “verified.” What does this mean for your network’s actual robustness? And more importantly, why can’t verification tools just give you a yes-or-no answer like a test suite?

The answer lies in understanding two fundamental properties that govern all verification methods: soundness and completeness. These aren’t just academic concepts—they determine whether you can trust your verification results, how much computational effort you need to invest, and which tools are appropriate for your use case. This guide explores these formal properties and their practical implications.

Formal Definitions¶

Before diving into the practical implications, let’s establish the formal mathematical definitions from the verification literature.

Robustness Verification:

A robustness verification approach aims to certify the lower bound of a model’s performance against any adversary under certain constraints (e.g., \(\ell_\infty\)-bounded attack). We can categorize verification approaches along two orthogonal dimensions [Katz et al., 2017, Singh et al., 2019, Weng et al., 2018]:

Complete vs. Incomplete: Whether the method can always reach a definitive conclusion
Deterministic vs. Probabilistic: Whether guarantees are absolute or hold with high probability

Definition: Robustness Verification (Formal)

An algorithm \(\mathcal{A}\) is a robustness verification if for any \((x_0, y_0)\), when an adversarial example exists (i.e., \(\exists x \in B_{p,\epsilon}(x_0)\) with \(F_\theta(x) \neq y_0\)), then:

Deterministic: \(\mathcal{A}(f_\theta, x_0, y_0, \epsilon) = \text{false}\)
Probabilistic: \(\Pr[\mathcal{A}(f_\theta, x_0, y_0, \epsilon) = \text{false}] \geq 1 - \alpha\)

where \(\alpha\) is a pre-defined small threshold (e.g., 0.001).

If \(\mathcal{A}(f_\theta, x_0, y_0, \epsilon) = \text{true}\), we say \(\mathcal{A}\) provides robustness certification for model \(f_\theta\) on \((x_0, y_0)\) against \((\ell_p, \epsilon)\)-adversary.

Complete Verification:

Whenever the verification returns “not verified” (false), if it is guaranteed that an adversarial example exists, we call it complete verification. Otherwise, it is incomplete verification. Complete verification provides both soundness (no false positives) and completeness (no false negatives).

However, complete verification for ReLU networks is NP-complete [Katz et al., 2017, Weng et al., 2018], making it computationally intractable for large networks. This fundamental barrier—the scalability challenge—prevents complete verification from scaling to common DNN sizes.

The Fundamental Trade-off:

As proven in Weng et al. [2018], unless \(\text{NP} = \text{P}\), there is no polynomial-time verification that can guarantee a constant fraction between its certified robustness radius and the optimal radius achievable by complete verification. This means the trade-off between scalability and tightness is not just practical—it’s theoretical and fundamental.

Soundness: The Guarantee You Can Trust¶

Imagine your verification tool says “this network is robust.” Can you trust that answer? Not always—it depends on whether the tool is sound.

What Soundness Means

A verification method is sound if, whenever it says “the property holds,” the property actually holds for the real network. More formally:

\[\text{Sound method: If verifier says SAFE, then the network is truly SAFE}\]

Sound methods never produce false positives. If the verifier proves your network is safe, you can bank on it.

Why Soundness Matters

Safety-critical applications—autonomous vehicles, medical devices, aviation systems—cannot tolerate false positives. A verification tool that occasionally lies about safety is worse than useless; it’s dangerous. Sound methods provide the mathematical guarantee that “if we say it’s safe, it’s actually safe.”

Verification vs Falsification

Adversarial attacks [Carlini and Wagner, 2017, Goodfellow et al., 2015, Madry et al., 2018] are falsifiers, not verifiers. They aim to find counterexamples (adversarial examples) rather than prove their absence. When an attack finds an adversarial example, it’s a real counterexample—you can verify by running the network on that input. This makes attacks sound for falsification: any bug found is real. However, when an attack fails to find adversarial examples, you’ve learned nothing about whether they exist—attacks are incomplete for falsification. A more clever attack might succeed where yours failed, but that doesn’t make the successful attack’s findings “unsound.”

By contrast, methods like Interval Bound Propagation (IBP) [Gowal et al., 2019] and CROWN [Weng et al., 2018] are sound. When they certify a network as robust, that certification is mathematically valid—no stronger attack will break it within the proven bounds.

The Cost of Soundness

Soundness comes with a price: conservative approximations. Sound methods must account for all possible behaviors in their analysis region, so they typically produce loose bounds. These conservative bounds might say “unknown” for regions where the network is actually safe, but the verifier simply can’t prove it with its approximations.

When Soundness is Non-Negotiable

Use sound verification methods when failures have serious consequences: safety-critical systems, regulatory compliance, certification requirements. Use adversarial attacks when speed matters more than proofs: development iteration, rapid prototyping, finding obvious bugs. Remember: attacks are sound for bug-finding (any bug found is real), just incomplete (may miss bugs) and unable to prove robustness.

Hint

Recognizing verification vs falsification tools: Verifiers (CROWN, Marabou, ERAN) prove properties hold—they’re sound for verification. Falsifiers (adversarial attacks, fuzzing) find bugs—they’re sound for falsification but can’t prove robustness. Read papers carefully—some tools claim verification soundness while using heuristics that violate it.

Completeness: Finding All the Answers¶

Even if a verification method is sound, it might not always give you a definite answer. Some verifiers can return “unknown”—meaning they couldn’t determine whether the property holds or not.

What Completeness Means

A verification method is complete if it can always determine whether a property holds or not. No “unknown” results—just “safe” or “unsafe.”

\[\text{Complete method: } \forall x \in \text{Input Domain}, \text{ decides SAFE or UNSAFE}\]

Complete methods always reach a conclusion, though they might take arbitrarily long to do so.

Complete vs Incomplete Methods

Complete verification methods include SMT solvers (Marabou [Katz et al., 2017]), MILP solvers [Tjeng et al., 2019], and branch-and-bound algorithms with sufficient refinement. These methods can theoretically answer any verification query with certainty, but they may require significant computational effort.

Incomplete methods like CROWN [Weng et al., 2018] and DeepPoly [Singh et al., 2019] use approximations that, while sound, can be too conservative to decide on some inputs. When they can’t determine the answer within their approximations, they return “unknown.”

The “Unknown” Result

When an incomplete verifier returns “unknown,” it doesn’t mean the network is unsafe. It means the verifier’s approximations weren’t tight enough to prove safety, and the method isn’t designed to exhaustively search for counterexamples like SMT solvers do. Think of “unknown” as “I can’t prove it from where I’m standing, but that doesn’t mean you’re wrong.”

Computational Cost

Completeness is expensive. Complete methods must handle the worst case: exploring a potentially infinite space of possibilities. They use refinement strategies like branch-and-bound—splitting the input space into smaller regions and analyzing each one recursively. This branching can explore exponentially many regions, making completeness impractical for large networks or large input regions.

Understanding “Unknown” Results

An “unknown” result from an incomplete verifier is not failure. It reflects the inherent limitation of using approximations. Rather than seeing it as “maybe unsafe,” interpret it as “requires more computational resources or a tighter method to determine.”

Property	Complete Methods	Incomplete Methods
Always decides?	Yes (theoretically)	No, may return “unknown”
Sound?	Yes	Yes
Speed	Variable, can be very slow	Fast, predictable
Scalability	Limited (small networks)	Good (moderate to large networks)
Examples	SMT solvers, MILP, branch-and-bound	CROWN, DeepPoly, IBP

The Four Quadrants: Combining Soundness and Completeness¶

These two properties are independent—you can be sound without being complete, complete without being sound, or neither. This creates a matrix of verification approaches.

Sound + Complete: The Gold Standard

SMT solvers and MILP-based verifiers occupy this quadrant. If they terminate, their answer is definitive and correct. “Verified” means truly verified; “counterexample found” provides a real adversarial example. The tradeoff: they’re expensive and don’t scale well to large networks. These methods are reserved for final certification of critical systems or small networks.

Sound + Incomplete: The Practical Choice

Most modern neural network verifiers (CROWN, DeepPoly, IBP) live here. They’re sound—if they say “verified,” it’s true—but may return “unknown.” They’re much faster than complete methods, making them practical for larger networks. Most verification research focuses on tightening these methods, shrinking the “unknown” region.

Unsound Verification: Rare Territory

Some approximation methods might sacrifice verification soundness for speed, but this is rarely acceptable—if a verifier might lie about safety, its guarantees are meaningless. Most practical methods prioritize soundness over completeness.

Sound Falsification, Incomplete Coverage: The Attacker’s Domain

Adversarial attacks, fuzzing, and testing are falsifiers. They find bugs rather than prove their absence. Key properties:

Sound for falsification: Any bug or counterexample found is real (verifiable by execution)
Incomplete: May miss existing bugs
Cannot verify: Not finding a bug doesn’t prove robustness

These methods are fast and valuable for development—they find real bugs quickly. The limitation is they can’t prove safety, not that their findings are unreliable.

	Complete	Incomplete
Sound	SMT, MILP (gold standard)	CROWN, DeepPoly (most practical)
Falsification	N/A (falsifiers don’t aim for completeness in verification sense)	Adversarial attacks (sound for bug-finding, cannot prove safety)

No Free Lunch: Tradeoffs Are Fundamental

You cannot have soundness, completeness, and speed simultaneously for large networks. Practical verification requires choosing priorities: sound + complete + slow, or sound + fast + incomplete. Alternatively, use falsification methods (attacks) which are fast and find real bugs but cannot prove safety. Your application determines which approach to use.

Deterministic vs. Probabilistic Verification¶

Beyond the sound/complete dichotomy, verification methods also differ in whether they provide deterministic or probabilistic guarantees.

Deterministic Verification:

Traditional verification methods (SMT solvers [Katz et al., 2017], MILP [Tjeng et al., 2019], bound propagation [Singh et al., 2019, Weng et al., 2018]) are deterministic: given the same input, they always produce the same result. When a deterministic verifier says “verified,” the property holds with certainty, not just high probability.

Probabilistic Verification:

Randomized smoothing [Cohen et al., 2019] represents a fundamentally different paradigm. Instead of verifying the original network \(f_\theta\), it verifies a smoothed version \(g(x) = \mathbb{E}_{noise}[f_\theta(x + noise)]\) where noise is sampled from a distribution (typically Gaussian). The verification guarantee holds with high probability (e.g., 99.9%) where the randomness is independent of the input.

Probabilistic Certification Guarantee

For randomized smoothing [Cohen et al., 2019], the certification states: “With probability at least \(1 - \alpha\) (e.g., \(\alpha = 0.001\)), the smoothed classifier \(g\) is robust within radius \(\epsilon\) at input \(x_0\).”

This is different from deterministic methods which state: “The network \(f_\theta\) is provably robust within radius \(\epsilon\) at \(x_0\).”

Trade-offs:

Deterministic methods: Stronger guarantees (certainty vs. high probability), but typically worse scalability or looser bounds
Probabilistic methods: Better scalability to large networks and larger perturbations, but guarantees hold only with high probability and apply to smoothed classifiers rather than the original network

Importantly, all existing probabilistic verification approaches are incomplete [Cohen et al., 2019]. The sound/complete and deterministic/probabilistic dimensions are orthogonal, creating a richer taxonomy of verification approaches.

Soundness in Practice: Over-Approximation¶

How do sound methods achieve their guarantees? They rely on over-approximation.

Imagine computing the set of all possible outputs your network can produce given an input region. This “reachable set” is what we want to analyze. But computing it exactly is intractable. Sound methods instead compute an approximation that contains the true set:

\[\text{Approximated Set} \supseteq \text{True Reachable Set}\]

By containing the true set, any property that holds for the approximation holds for the true outputs. This is the foundation of soundness.

Interval Arithmetic as an Example

The simplest sound method is interval arithmetic: track intervals at each neuron. For input interval [l, u], compute output intervals layer by layer. Each neuron’s output is bounded by the minimum and maximum over its possible inputs.

This is sound because the computed intervals always contain the true outputs. But it’s conservative—the intervals grow as you propagate through layers, becoming looser and looser. Eventually, the intervals become so large that they provide no useful information.

Tightness vs Soundness

Sound methods must over-approximate, so all sound methods are conservative to some degree. But they vary in how conservative they are. Tight methods produce approximations close to the true set (good for proving robustness). Loose methods are very conservative (might say “unknown” when safety actually holds).

Importantly, tightness and soundness are orthogonal properties. A method can be: - Sound and tight (CROWN on favorable networks) - Sound and loose (interval arithmetic) - Sound falsification and effective (adversarial attacks that cleverly find real weaknesses)

Tip

When choosing between sound methods, prioritize tightness. A sound but loose verifier may waste time returning “unknown” for actually-safe regions. Tighter methods (CROWN) are preferable to looser methods (interval arithmetic) when computational budget allows.

Real-World Implications: Choosing Your Guarantees¶

How do these properties affect your actual verification workflow?

Safety-Critical Systems

Autonomous vehicles, medical devices, aviation systems—these require sound methods. The consequences of false positives (reporting something as safe when it’s not) are unacceptable. Whether you need completeness depends on regulations and liability. Often, sound + incomplete is sufficient if the certified regions are large enough.

Development and Research

During model development, adversarial attacks are perfectly appropriate. They quickly find real bugs and guide improvements. The lack of verification guarantees doesn’t matter because you’re not relying on attacks to certify the final product—you’re using them to find and fix problems. Any bug found by an attack is real and actionable.

Regulatory Compliance and Certification

Some regulations (aviation, medical) may require formal guarantees. Sound verification is necessary. Completeness may be required depending on the standard, though practical incomplete methods increasingly meet regulatory requirements when the verified regions are comprehensive enough.

Performance Tradeoffs

Your choice of soundness and completeness is really a choice about tradeoff priorities: - Need fast bug-finding? Use adversarial attacks (fast, finds real bugs, but can’t prove safety) - Need practical robustness without absolute guarantees? Use sound, incomplete (CROWN, DeepPoly) - Need certification for critical systems? Use sound, complete (SMT solvers, though slower)

Scenario	Soundness	Completeness	Reasoning
Research/development	Not required	Not required	Speed and iteration matter more
Production robustness	Required	Not required	Need guarantees but “unknown” is acceptable
Safety-critical certification	Required	Often required	Regulations may demand completeness
Regulatory compliance	Required	Case-by-case	Depends on specific standards

Common Misconceptions¶

Misconception 1: “Unknown means unsafe”

False. “Unknown” from an incomplete verifier means the verifier couldn’t prove safety with its approximations. The network might be perfectly safe. It just means you need tighter bounds, a smaller input region, or a complete method to determine the answer.

Misconception 2: “Sound methods are always better”

Not always. Sound verification methods are necessary when you need safety proofs, but adversarial attacks are faster for finding bugs. For development iteration, attacks quickly find real issues that verification methods would take hours to process—and any bug found is actionable.

Misconception 3: “Complete methods always terminate quickly”

No. “Complete” means they’ll eventually reach a conclusion, but that conclusion might take unbounded time. In practice, timeouts make them incomplete, or they return the best answer found so far.

Misconception 4: “Tight bounds are unsound”

These properties are independent. A method can be both sound and tight (good!), or sound and loose (conservative). Tightness measures approximation quality; soundness measures correctness.

Misconception 5: “All verifiers are sound”

Many tools make unsound approximations or use unsound-by-default methods. Some papers claim soundness while using heuristics that violate it. Always check whether the tool’s guarantees are proven or just claimed.

Critical Thinking About Tool Claims

When evaluating verification tools, ask: Is soundness proven mathematically, or just claimed? Does completeness mean theoretical completeness (eventually, with infinite time) or practical completeness (under stated assumptions)? Does “unknown” mean the method is incomplete, or does the method not return “unknown” at all? The answers determine whether you can trust the tool for critical applications.

Final Thoughts¶

Soundness and completeness are orthogonal properties that define the guarantees verification methods provide. Understanding them helps you choose appropriate tools for your use case:

Sound + Complete (SMT solvers): Highest assurance, highest cost
Sound + Incomplete (CROWN, DeepPoly): Practical verification, most common
Falsification methods (Adversarial attacks): Fastest for bug-finding, sound when bugs found, cannot prove safety
Unsound verification (rare): Methods that might give false “verified” results—avoid for safety-critical applications

The uncomfortable truth: for large networks, you cannot have soundness, completeness, and speed simultaneously. You must prioritize. Safety-critical applications prioritize soundness. Research prioritizes speed. Practical systems prioritize the balance between sound guarantees and computational feasibility.

The field is actively developing better algorithms. Tighter bounds for incomplete methods reduce “unknown” results. Faster complete methods make exhaustive verification more practical. The gap between sound and complete narrows gradually, but the fundamental tradeoffs remain.

Understand these properties, choose tools accordingly, and your verification efforts will be far more effective and trustworthy.

Next Guide

Continue to Verification Taxonomy: A Systematic View to explore the different categories and approaches in neural network verification methods.

[1]

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), 39–57. IEEE, 2017.

[2] (1,2,3,4)

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning. 2019.

[3]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations. 2015.

[4] (1,2)

Sven Gowal, Krishnamurthy Dj Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. Scalable verified training for provably robust image classification. In Proceedings of the IEEE International Conference on Computer Vision, 4842–4851. 2019.

[5] (1,2,3,4,5,6,7)

Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: an efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, 97–117. Springer, 2017.

[6]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations. 2018.

[7] (1,2,3,4)

Gagandeep Singh, Rupanshu Ganvir, Markus Püschel, and Martin Vechev. Beyond the single neuron convex barrier for neural network certification. In Advances in Neural Information Processing Systems, 15072–15083. 2019.

[8] (1,2,3)

Vincent Tjeng, Kai Y. Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations. 2019.

[9] (1,2,3,4,5,6,7,8,9)

Lily Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane Boning, and Inderjit Dhillon. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, 5276–5285. 2018.

Soundness and Completeness

Contents

Soundness and Completeness¶

Formal Definitions¶

Soundness: The Guarantee You Can Trust¶

Completeness: Finding All the Answers¶

The Four Quadrants: Combining Soundness and Completeness¶

Deterministic vs. Probabilistic Verification¶

Soundness in Practice: Over-Approximation¶

Completeness in Practice: Refinement and Branching¶

Real-World Implications: Choosing Your Guarantees¶

Common Misconceptions¶

Final Thoughts¶

Comments & Discussion