Soundness and Completeness in Neural Network Verification#

Your verification tool returns “unknown.” You run it again with a different method and get “verified.” What does this mean for your network’s actual robustness? And more importantly, why can’t verification tools just give you a yes-or-no answer like a test suite?

The answer lies in understanding two fundamental properties that govern all verification methods: soundness and completeness. These aren’t just academic concepts—they determine whether you can trust your verification results, how much computational effort you need to invest, and which tools are appropriate for your use case. This guide explores these formal properties and their practical implications.

Soundness: The Guarantee You Can Trust#

Imagine your verification tool says “this network is robust.” Can you trust that answer? Not always—it depends on whether the tool is sound.

What Soundness Means

A verification method is sound if, whenever it says “the property holds,” the property actually holds for the real network. More formally:

\[\text{Sound method: If verifier says SAFE, then the network is truly SAFE}\]

Sound methods never produce false positives. If the verifier proves your network is safe, you can bank on it.

Why Soundness Matters

Safety-critical applications—autonomous vehicles, medical devices, aviation systems—cannot tolerate false positives. A verification tool that occasionally lies about safety is worse than useless; it’s dangerous. Sound methods provide the mathematical guarantee that “if we say it’s safe, it’s actually safe.”

Sound vs Unsound Methods

Adversarial attacks are fundamentally unsound. When you run PGD attacks and find no adversarial examples, you’ve learned nothing about whether adversarial examples actually exist. You’ve only learned that your particular attack method, starting from your particular initialization, didn’t find them. A more clever attack might succeed where yours failed.

By contrast, methods like Interval Bound Propagation (IBP) and CROWN are sound. When they certify a network as robust, that certification is mathematically valid—no stronger attack will break it within the proven bounds.

The Cost of Soundness

Soundness comes with a price: conservative approximations. Sound methods must account for all possible behaviors in their analysis region, so they typically produce loose bounds. These conservative bounds might say “unknown” for regions where the network is actually safe, but the verifier simply can’t prove it with its approximations.

When Soundness is Non-Negotiable

Use sound verification methods when failures have serious consequences: safety-critical systems, regulatory compliance, certification requirements. Use unsound methods when speed matters more than guarantees: development iteration, rapid prototyping, finding obvious bugs.

Hint

Recognizing sound vs unsound tools: Sound methods come from formal verification research (CROWN, Marabou, ERAN). Unsound methods are attack-based (adversarial attacks, fuzzing, coverage-guided testing). Read papers carefully—some tools claim soundness incorrectly or make unsound approximations.

Completeness: Finding All the Answers#

Even if a verification method is sound, it might not always give you a definite answer. Some verifiers can return “unknown”—meaning they couldn’t determine whether the property holds or not.

What Completeness Means

A verification method is complete if it can always determine whether a property holds or not. No “unknown” results—just “safe” or “unsafe.”

\[\text{Complete method: } \forall x \in \text{Input Domain}, \text{ decides SAFE or UNSAFE}\]

Complete methods always reach a conclusion, though they might take arbitrarily long to do so.

Complete vs Incomplete Methods

Complete verification methods include SMT solvers (Marabou), MILP solvers, and branch-and-bound algorithms with sufficient refinement. These methods can theoretically answer any verification query with certainty, but they may require significant computational effort.

Incomplete methods like CROWN and DeepPoly use approximations that, while sound, can be too conservative to decide on some inputs. When they can’t determine the answer within their approximations, they return “unknown.”

The “Unknown” Result

When an incomplete verifier returns “unknown,” it doesn’t mean the network is unsafe. It means the verifier’s approximations weren’t tight enough to prove safety, and the method isn’t designed to exhaustively search for counterexamples like SMT solvers do. Think of “unknown” as “I can’t prove it from where I’m standing, but that doesn’t mean you’re wrong.”

Computational Cost

Completeness is expensive. Complete methods must handle the worst case: exploring a potentially infinite space of possibilities. They use refinement strategies like branch-and-bound—splitting the input space into smaller regions and analyzing each one recursively. This branching can explore exponentially many regions, making completeness impractical for large networks or large input regions.

Understanding “Unknown” Results

An “unknown” result from an incomplete verifier is not failure. It reflects the inherent limitation of using approximations. Rather than seeing it as “maybe unsafe,” interpret it as “requires more computational resources or a tighter method to determine.”

Property

Complete Methods

Incomplete Methods

Always decides?

Yes (theoretically)

No, may return “unknown”

Sound?

Yes

Yes

Speed

Variable, can be very slow

Fast, predictable

Scalability

Limited (small networks)

Good (moderate to large networks)

Examples

SMT solvers, MILP, branch-and-bound

CROWN, DeepPoly, IBP

The Four Quadrants: Combining Soundness and Completeness#

These two properties are independent—you can be sound without being complete, complete without being sound, or neither. This creates a matrix of verification approaches.

Sound + Complete: The Gold Standard

SMT solvers and MILP-based verifiers occupy this quadrant. If they terminate, their answer is definitive and correct. “Verified” means truly verified; “counterexample found” provides a real adversarial example. The tradeoff: they’re expensive and don’t scale well to large networks. These methods are reserved for final certification of critical systems or small networks.

Sound + Incomplete: The Practical Choice

Most modern neural network verifiers (CROWN, DeepPoly, IBP) live here. They’re sound—if they say “verified,” it’s true—but may return “unknown.” They’re much faster than complete methods, making them practical for larger networks. Most verification research focuses on tightening these methods, shrinking the “unknown” region.

Unsound + Complete: Rare Territory

Some search-based methods with exhaustive guarantees might theoretically fit here, but completeness isn’t meaningful without soundness in practice. If your complete method sometimes lies, the completeness guarantee becomes worthless.

Unsound + Incomplete: The Fuzzer’s Domain

Adversarial attacks, fuzzing, and coverage-guided testing belong here. They’re neither sound nor complete—they may miss bugs entirely, or report false positives. But they’re fast and can find concrete failures, making them valuable for development and debugging.

Complete

Incomplete

Sound

SMT, MILP (gold standard)

CROWN, DeepPoly (most practical)

Unsound

Rare/theoretical

Adversarial attacks (fast, practical)

No Free Lunch: Tradeoffs Are Fundamental

You cannot have soundness, completeness, and speed simultaneously for large networks. Practical verification requires choosing priorities: sound + complete + slow, or sound + fast + incomplete, or fast + complete + unsound. Your application determines which tradeoffs to accept.

Soundness in Practice: Over-Approximation#

How do sound methods achieve their guarantees? They rely on over-approximation.

Imagine computing the set of all possible outputs your network can produce given an input region. This “reachable set” is what we want to analyze. But computing it exactly is intractable. Sound methods instead compute an approximation that contains the true set:

\[\text{Approximated Set} \supseteq \text{True Reachable Set}\]

By containing the true set, any property that holds for the approximation holds for the true outputs. This is the foundation of soundness.

Interval Arithmetic as an Example

The simplest sound method is interval arithmetic: track intervals at each neuron. For input interval [l, u], compute output intervals layer by layer. Each neuron’s output is bounded by the minimum and maximum over its possible inputs.

This is sound because the computed intervals always contain the true outputs. But it’s conservative—the intervals grow as you propagate through layers, becoming looser and looser. Eventually, the intervals become so large that they provide no useful information.

Tightness vs Soundness

Sound methods must over-approximate, so all sound methods are conservative to some degree. But they vary in how conservative they are. Tight methods produce approximations close to the true set (good for proving robustness). Loose methods are very conservative (might say “unknown” when safety actually holds).

Importantly, tightness and soundness are orthogonal properties. A method can be: - Sound and tight (CROWN on favorable networks) - Sound and loose (interval arithmetic) - Unsound and tight (adversarial attacks that cleverly find weaknesses)

Tip

When choosing between sound methods, prioritize tightness. A sound but loose verifier may waste time returning “unknown” for actually-safe regions. Tighter methods (CROWN) are preferable to looser methods (interval arithmetic) when computational budget allows.

Completeness in Practice: Refinement and Branching#

Complete methods guarantee they’ll reach a definitive answer, but how? Through refinement strategies.

Branch and Bound

The classic approach is branch-and-bound: recursively split the input domain into smaller regions. For each region, apply an incomplete method (like CROWN) to get an upper bound on the loss or output value. If the bound proves the property holds, that region is done. If not, split it further and recurse.

Eventually, either all regions are verified (property holds globally) or you find a region where the bound suggests checking for counterexamples (property fails). This exhaustive exploration guarantees completeness.

Abstraction Refinement

SMT solvers use a different strategy: abstract interpretation with refinement. They start with coarse abstractions of the network, verify against those, and if verification fails, refine the abstraction to be more precise. They iterate until they either prove the property or find a real counterexample.

Why Completeness is Expensive

The problem is exponential blow-up. With branch-and-bound, the recursion tree can grow exponentially in depth and branching factor. For a modestly sized network and input region, the number of regions to analyze can become astronomical.

Practical Completeness: Timeouts

In practice, “complete” methods hit computational limits. They might have a timeout or memory limit. When the limit is reached, they might return “unknown” (making them incomplete in practice) or keep the best answer found so far. True completeness is theoretically guaranteed but practically limited by computational resources.

Hint

Only invest in complete verification when necessary: small networks, small input regions, or when certification is mandatory. For most development and testing, incomplete sound methods suffice and are far more practical.

Real-World Implications: Choosing Your Guarantees#

How do these properties affect your actual verification workflow?

Safety-Critical Systems

Autonomous vehicles, medical devices, aviation systems—these require sound methods. The consequences of false positives (reporting something as safe when it’s not) are unacceptable. Whether you need completeness depends on regulations and liability. Often, sound + incomplete is sufficient if the certified regions are large enough.

Development and Research

During model development, unsound methods like adversarial attacks are perfectly appropriate. They quickly identify bugs and guide improvements. The lack of guarantees doesn’t matter because you’re not relying on verification to certify the final product—you’re using it to iterate.

Regulatory Compliance and Certification

Some regulations (aviation, medical) may require formal guarantees. Sound verification is necessary. Completeness may be required depending on the standard, though practical incomplete methods increasingly meet regulatory requirements when the verified regions are comprehensive enough.

Performance Tradeoffs

Your choice of soundness and completeness is really a choice about tradeoff priorities: - Need fast iteration? Use unsound, incomplete methods (adversarial attacks) - Need practical robustness without absolute guarantees? Use sound, incomplete (CROWN, DeepPoly) - Need certification for critical systems? Use sound, complete (SMT solvers, though slower)

Scenario

Soundness

Completeness

Reasoning

Research/development

Not required

Not required

Speed and iteration matter more

Production robustness

Required

Not required

Need guarantees but “unknown” is acceptable

Safety-critical certification

Required

Often required

Regulations may demand completeness

Regulatory compliance

Required

Case-by-case

Depends on specific standards

Common Misconceptions#

Misconception 1: “Unknown means unsafe”

False. “Unknown” from an incomplete verifier means the verifier couldn’t prove safety with its approximations. The network might be perfectly safe. It just means you need tighter bounds, a smaller input region, or a complete method to determine the answer.

Misconception 2: “Sound methods are always better”

Not for practical use. Sound methods are necessary when safety matters, but unsound methods are faster and more scalable. For finding bugs during development, unsound attacks often find issues that sound methods would take hours to process.

Misconception 3: “Complete methods always terminate quickly”

No. “Complete” means they’ll eventually reach a conclusion, but that conclusion might take unbounded time. In practice, timeouts make them incomplete, or they return the best answer found so far.

Misconception 4: “Tight bounds are unsound”

These properties are independent. A method can be both sound and tight (good!), or sound and loose (conservative). Tightness measures approximation quality; soundness measures correctness.

Misconception 5: “All verifiers are sound”

Many tools make unsound approximations or use unsound-by-default methods. Some papers claim soundness while using heuristics that violate it. Always check whether the tool’s guarantees are proven or just claimed.

Critical Thinking About Tool Claims

When evaluating verification tools, ask: Is soundness proven mathematically, or just claimed? Does completeness mean theoretical completeness (eventually, with infinite time) or practical completeness (under stated assumptions)? Does “unknown” mean the method is incomplete, or does the method not return “unknown” at all? The answers determine whether you can trust the tool for critical applications.

Final Thoughts#

Soundness and completeness are orthogonal properties that define the guarantees verification methods provide. Understanding them helps you choose appropriate tools for your use case:

  • Sound + Complete (SMT solvers): Highest assurance, highest cost

  • Sound + Incomplete (CROWN, DeepPoly): Practical verification, most common

  • Unsound + Incomplete (Adversarial attacks): Fastest, best for development

  • Unsound + Complete (rare): Theoretical interest, little practical value

The uncomfortable truth: for large networks, you cannot have soundness, completeness, and speed simultaneously. You must prioritize. Safety-critical applications prioritize soundness. Research prioritizes speed. Practical systems prioritize the balance between sound guarantees and computational feasibility.

The field is actively developing better algorithms. Tighter bounds for incomplete methods reduce “unknown” results. Faster complete methods make exhaustive verification more practical. The gap between sound and complete narrows gradually, but the fundamental tradeoffs remain.

Understand these properties, choose tools accordingly, and your verification efforts will be far more effective and trustworthy.

Note

Further reading: