Skip to main content
Phase 1: Foundations Guide 4

Threat Models in Neural Network Verification

Threat model formulations including Lp norms, semantic perturbations, and attack specifications for neural network verification.

Concrete scenario: You’re building a self-driving car’s stop-sign classifier. Deploy it, and attackers immediately ask: “what can we change to fool it?”

  • Stick some red stickers on the sign? Maybe the network still recognizes it.
  • Rotate the sign 5 degrees? Depends on training.
  • Change lighting conditions? Natural in the world.
  • Combine all three? Now it’s complex.

Your threat model decides: “which of these attacks matter? How much change is plausible?” The answer determines what you actually verify.

The threat model is your contract with reality. Verification says: “I proved robustness against these specific attacks, under these specific assumptions.” Get the assumptions wrong, and your proof is meaningless.

What Is a Threat Model?

Simple: define the attacker’s power.

  • What can they change? Pixel colors? Text? Sensor readings? All or only some?
  • How much can they change? 1% of the input? 10%? Unlimited?
  • What else matters? Does the perturbation need to look natural? Sound plausible? Remain physical?
  • What do they know? Do they see the network’s weights? Or just its outputs?

This definition becomes your verification target. Prove the network survives these attacks. That’s the guarantee you get.

Common Threat Models

Most verification tools focus on norm-based perturbations. Think of norms as different ways to measure “how much change is the attacker allowed?”

Per-Pixel Budget (L-infinity)

Simplest threat model: “Each pixel can change by at most ε.”

If an image is normalized [0,1], maybe ε = 0.03. Attacker can bump any pixel by up to 3% independently. Red pixel goes from 0.5 to 0.47 or 0.53. That’s it.

Why: Simple to understand. Matches “imperceptible perturbations” (small per-pixel changes are hard to see). Standard in research.

Limitation: Ignores structure. Changing 1000 pixels by 3% is way more noticeable than 1 pixel by 3%, but this model allows both equally.

Total Energy Budget (L2)

Different angle: “Total perturbation energy ≤ ε.”

Think of it like a budget pool. Attacker can distribute changes however they want, as long as the total squared magnitude stays under budget. So: change 1 pixel by a lot, or change 1000 pixels by tiny amounts. Same total cost.

Why: Models real-world perturbations better (lighting changes, blur, slight rotations). Often matches perception better than per-pixel limits.

Limitation: Doesn’t guarantee perceptual similarity. Concentrating all energy in one region might still be imperceptible; spreading it evenly might be very noticeable.

Sum-of-Absolute-Changes Budget (L1)

“Total absolute change ≤ ε (sum all changes).”

Similar to L2, but uses absolute value instead of squares. This naturally encourages sparse attacks: change few things by a lot, rather than many things by a little.

Why: Good for tabular data. Attacker changes maybe 2 features significantly (age, income) rather than tiny tweaks to all features. Realistic in domains like fraud detection or credit scoring.

Limitation: Less common than L-infinity or L2. Fewer verification tools support it.

Limited Modifications (L0)

Strictest constraint: “Attacker can change at most k pixels/features total. No limit on magnitude per change.”

Example: “Adversary paints 5 pixels black. Each pixel: unrestricted (0→1 is OK). But only 5 pixels total.”

Why: Models realistic attackers with limited control (physical stickers on signs, tamper with few sensors). More powerful per-change but fewer changes allowed.

The verification catch: Checking “all subsets of k pixels” explodes combinatorially. 32×32 image, k=10 pixels: how many subsets? Way too many. Standard verification tools break here.

Quick Comparison:

  • L-infinity: Easy to verify. Simple. Slightly conservative (doesn’t match perception perfectly).
  • L2: Also easy to verify. Better matches real-world perturbations (lighting, blur). More flexible budget.
  • L1: Rare. Middle ground: encourages sparsity like L0, but convex (easier to verify).
  • L0: Hard to verify. Models limited adversaries (few control points). Combinatorial explosion.

Real-World Threat Models

Math-based norms are clean, but reality is messier.

Semantic Perturbations

Text example: paraphrase a sentence. Meaning stays same, words differ. Image example: rotate a photo 5 degrees. Object identity unchanged, but pixels shift everywhere.

Problem: “semantically equivalent” has no formula. How do you define it mathematically? You can’t, really. Current practice: approximate with constrained transformations (rotation ≤ 5°, brightness ≤ 10%, etc.). Or use learned similarity metrics (embeddings). Neither is perfect.

State: Mostly unsolved. You can verify robustness to specific semantic shifts (known rotations, known brightness ranges), but general semantic robustness remains open.

Physical Attacks

Adversarial stickers on road signs. Patches on clothing. Lighting changes in a scene. These perturbations have physical constraints:

  • Must be spatially localized (not scattered pixels)
  • Must be printable/realizable
  • Must survive varying viewing angles, lighting
  • Must respect 3D geometry if on objects

These are more realistic than math norms but harder to verify. Some tools verify specific physical transformations (known rotations, brightness). Full physical adversarial robustness? Still mostly unsolved.

Attacker Knowledge

Besides “how much perturbation?”, ask: “what does attacker know?”

White-box (worst-case): Attacker knows everything. All weights, architecture, training procedure. Can compute gradients, find perfect perturbations instantly.

Black-box (more realistic): Attacker only sees outputs. No weights. Must search blindly or try queries.

Which is stronger? White-box. If you’re robust against someone who knows your internals and can compute your gradients, you’re automatically robust against someone who only sees your outputs.

Verification assumption: Assume white-box. Prove robustness against the smartest adversary. Everything else follows for free.

Other Verification Properties

Threat models often focus on adversarial perturbations, but other properties matter too.

Reachability

Question: “Given sensor noise in range [a, b], what control outputs can the network produce?”

Verify the outputs stay in safe ranges (steering angles between -45°/45°, treatment doses between 0-1000mg). Unlike robustness (which asks “does output stay same?”), reachability asks “what’s the range of possible outputs?”

Useful for controllers and safety-critical systems.

Monotonicity

Question: “Does higher income always lead to higher credit score?”

Verify the network respects common-sense relationships. In credit scoring: higher income → higher score (monotone). In medicine: higher cholesterol → higher disease risk. These are domain constraints that verification can enforce.

Useful for ensuring fairness and interpretability.

Fairness

Question: “Does the network treat different demographic groups equally?”

Verify the network doesn’t discriminate. Harder to formalize than L_p robustness (what does “fair” even mean?), but important for hiring models, loan approval, medical diagnosis.

One angle: verify predictions don’t change if you swap gender/race (keeping qualifications identical). That’s a form of fairness verification.

Choosing Your Threat Model

Three questions:

1. What’s realistic? If actual attacks are rotations/lighting, don’t verify L-infinity. Verify what matters.

2. How stringent? Safety-critical (cars, medical)? Conservative threat model, tight budget. Non-critical? More relaxed.

3. Can you verify it? L-infinity and L2 are easy. L0 is hard. Physical/semantic are still research. Pick something that tools can handle.

Start with L-infinity or L2 (well-understood). If they don’t match your real threats, specialize.

Core Principle

Threat model is your contract. It says: “I verified robustness under these assumptions.” Different assumptions, different guarantees. Get the threat model wrong, get false confidence.

Document it clearly. “Verified L-infinity robust with budget 0.03” means something. “Verified robust” means nothing.

Key Takeaways

Threat Models Define Verification: What you verify depends on what attacks matter to you. Digital perturbations? Physical attacks? Semantic changes? Each requires different threat models.

Standard vs Realistic: L-infinity and L2 are mathematically clean and well-supported. Physical and semantic threat models are more realistic but harder to formalize.

White-Box Assumption: Verification assumes the smartest adversary—one who knows everything. This is conservative; everything else is easier.

No One-Size-Fits-All: Choose threat models that match your actual deployment. Start simple (L-infinity), refine if needed.

Next Guide

Continue to The Verification Problem to learn the mathematical formulation of neural network verification and the fundamental complexity barriers.