Threat Models in Neural Network Verification

Before verifying a neural network, you must answer a fundamental question: what attacks are you defending against? A network robust to small pixel changes might fail completely under semantic transformations. A classifier that handles $\ell_\infty$ perturbations well might be vulnerable to $\ell_2$ attacks. The threat model---your formal specification of what adversarial perturbations are possible---determines what verification proves and what it doesn’t.

Understanding threat models is essential for meaningful verification. Verify against the wrong threat model, and you’ve proven robustness to attacks that don’t matter while remaining vulnerable to attacks that do. This guide explores the different threat models used in neural network verification and how to choose appropriate ones for your application.

What Is a Threat Model?

A threat model formally specifies the capabilities and constraints of an adversary. In neural network verification, it defines:

What the adversary can change (which input features or dimensions)
How much they can change it (perturbation budget $\epsilon$ )
What constraints they face (perceptual similarity, physical realizability)
What knowledge they have (white-box vs black-box)

Formal Definition: Threat Model

A threat model defines an adversarial perturbation set $\Delta(x)$ for each input $x$ . An adversary can modify $x$ to any $x' \in \Delta(x)$ . The network is robust under this threat model if:

$\forall x', x' \in \Delta(x) \Rightarrow f(x') = f(x)$

Different choices of $\Delta(x)$ define different threat models with different robustness implications.

The threat model is a contract between the verifier and the world. Verification proves: “under the specified threat model, the network behaves correctly.” Change the threat model, and the guarantee changes. Choose a threat model that doesn’t match reality, and verification provides false security.

$\ell_p$ Norm Threat Models

The most common threat models constrain perturbations using $\ell_p$ norms. These define perturbation budgets in different geometric senses.

$\ell_\infty$ Threat Model

The $\ell_\infty$ (maximum) norm constrains the maximum change to any single input dimension:

$B_{\infty,\epsilon}(x_0) = \{x : \|x - x_0\|_\infty < \epsilon\}$

where $\|x - x_0\|_\infty = \max_i |x_i - x_{0,i}|$ .

Intuition: Every pixel (or input feature) can change by at most $\epsilon$ . This creates a hypercube centered at $x_0$ in input space.

Common uses: Image classification with bounded pixel perturbations. Standard values for normalized images: $\epsilon = 1/255, 2/255, 8/255$ .

Why it’s popular: Simple to understand, matches intuition about “small changes to each pixel,” and widely used in adversarial robustness literature.

Limitations: Doesn’t account for perceptual similarity---changing all pixels by $\epsilon$ can create very noticeable artifacts. Treats all pixels equally, ignoring structure.

$\ell_2$ Threat Model

The $\ell_2$ (Euclidean) norm constrains the total Euclidean distance:

$B_{2,\epsilon}(x_0) = \{x : \|x - x_0\|_2 < \epsilon\}$

where $\|x - x_0\|_2 = \sqrt{\sum_i (x_i - x_{0,i})^2}$ .

Intuition: The total perturbation magnitude across all dimensions is bounded. This creates a hypersphere centered at $x_0$ .

Common uses: Image classification where total perturbation energy matters more than per-pixel changes. Some physical perturbations (lighting changes, blur) are better modeled with $\ell_2$ .

Why it’s useful: Allows larger perturbations to some pixels if others change less. Can be more perceptually aligned than $\ell_\infty$ for some perturbation types.

Limitations: Still doesn’t capture semantic similarity. A large $\ell_2$ perturbation to a few pixels might be imperceptible if those pixels are in uniform regions.

$\ell_1$ Threat Model

The $\ell_1$ (Manhattan) norm constrains the sum of absolute changes:

$B_{1,\epsilon}(x_0) = \{x : \|x - x_0\|_1 < \epsilon\}$

where $\|x - x_0\|_1 = \sum_i |x_i - x_{0,i}|$ .

Intuition: The total absolute change across all dimensions is bounded. Encourages sparse perturbations.

Common uses: Sparse perturbations where an adversary can change few features significantly. Useful for tabular data where only a few fields should change.

Why it’s useful: Models scenarios where attackers can modify a few features substantially but not all features slightly.

Limitations: Less commonly used than $\ell_\infty$ or $\ell_2$ , so fewer tools support it natively.

$\ell_0$ Threat Model

The $\ell_0$ (“norm”) constrains the number of dimensions that can be changed:

$B_{0,k}(x_0) = \{x : \|x - x_0\|_0 \leq k\}$

where $\|x - x_0\|_0 = |\{i : x_i \neq x_{0,i}\}|$ counts the number of non-zero differences.

Intuition: An adversary can modify at most $k$ input features/pixels, but each modified feature can change by any amount (subject to domain constraints like [0,1] for images). This creates a discrete, non-convex perturbation set.

Common uses:

Pixel attacks: Modify a small number of pixels arbitrarily (e.g., 10 pixels in a 32x32 image)
Feature manipulation: Change few features in tabular data or sparse inputs
Backdoor attacks: Inject a small trigger pattern (few pixels) to cause misclassification

Why it’s useful: Models realistic scenarios where adversaries have limited control points but full authority over those points. More flexible than $\ell_1$ for sparse perturbations since individual changes aren’t bounded.

Challenges for verification:

Combinatorial explosion: With $d$ -dimensional input and budget $k$ , there are $\binom{d}{k}$ possible pixel sets to perturb
Non-convexity: Unlike $\ell_1$ , $\ell_2$ , $\ell_\infty$ , the $\ell_0$ ball is not convex, making standard verification techniques inapplicable
Limited tooling: Most verification tools don’t support $\ell_0$ natively; requires specialized approaches or enumeration

Verification approaches: Exact verification requires checking all $\binom{d}{k}$ combinations (intractable for large $d, k$ ). Practical methods use heuristics, sampling, or restrict to specific pixel sets.

Limitations: Much harder to verify than $\ell_p$ norms with $p \geq 1$ . Verification scales poorly with $k$ and input dimension $d$ . Often approximated with $\ell_1$ constraints in practice.

Norm	Definition	Geometric Shape	Perturbation Character	Verification Difficulty	Typical Use Cases
$\ell_\infty$	$\max_i \\|x_i - x_{0,i}\\|$	Hypercube (all dimensions equally bounded)	Each pixel/feature changes by at most $\epsilon$	Moderate --- Convex, well-supported	Image perturbations (FGSM, PGD), pixel-bounded attacks
$\ell_2$	$\sqrt{\sum_i (x_i - x_{0,i})^2}$	Hypersphere (Euclidean ball)	Total energy bounded; allows uneven distribution	Moderate --- Convex, well-supported, randomized smoothing	C&W attacks, global image changes, blur/lighting
$\ell_1$	$\sum_i \\|x_i - x_{0,i}\\|$	Cross-polytope (diamond shape)	Total absolute change bounded; encourages sparsity	Harder --- Convex but less tool support	Sparse features, tabular data, few-field attacks
$\ell_0$	$\\|\{i : x_i \neq x_{0,i}\}\\|$	Discrete union of subspaces	At most $k$ features change (by any amount)	Very Hard --- Non-convex, combinatorial, limited tools	Pixel attacks, backdoor triggers, few-pixel adversarial examples

Key Insights:

Convexity: $\ell_\infty$ , $\ell_2$ , $\ell_1$ form convex sets --- verification methods (LP, bound propagation) apply directly. $\ell_0$ is non-convex --- requires specialized techniques.
Tool support: $\ell_\infty$ and $\ell_2$ are best supported (Marabou, alpha-beta-CROWN, PRIMA, randomized smoothing). $\ell_1$ has moderate support. $\ell_0$ has minimal support.
Tightness vs. sparsity tradeoff: $\ell_\infty$ bounds each dimension tightly. $\ell_2$ allows more flexibility. $\ell_1$ encourages few large changes. $\ell_0$ strictly limits number of changes.
Physical realism: $\ell_2$ often better matches perception (total energy). $\ell_\infty$ models worst-case per-pixel. $\ell_0$ models limited-access adversaries.

Beyond $\ell_p$ Norms: Semantic and Physical Threat Models

$\ell_p$ norms capture mathematical perturbation bounds but don’t necessarily align with perceptual or semantic similarity. Alternative threat models address these limitations.

Semantic Threat Models

Semantic perturbations preserve meaning while changing the input. For text: paraphrasing, word substitutions, or grammatical changes that maintain semantic content. For images: rotations, translations, lighting changes, or style transfers that preserve object identity.

$\Delta_{\text{semantic}}(x) = \{x' : \text{semantically\_similar}(x, x')\}$

Challenge: “Semantically similar” is hard to formalize. Unlike $\ell_p$ norms, there’s no simple mathematical definition. Current approaches use:

Constrained transformations (rotation by at most theta degrees, brightness change by at most beta)
Learned similarity metrics (using embeddings or perceptual losses)
Domain-specific constraints (grammaticality for text, physical realizability for images)

Current state: Semantic verification is an active research area. Tools exist for specific semantic perturbations (rotations, translations) but general semantic verification remains challenging.

Physical Threat Models

Physical attacks involve modifications that can be realized in the physical world. Adversarial stickers on stop signs, adversarial patches worn as clothing, or lighting changes in a scene.

Physical constraints include:

Spatial locality: Perturbations must be spatially localized (e.g., a patch, not scattered pixels)
Printability: Colors must be printable with standard printers
Physical consistency: Perturbations must appear consistent across viewing angles, lighting conditions
3D constraints: Perturbations on 3D objects must respect surface geometry

$\Delta_{\text{physical}}(x) = \{x' : \text{physically\_realizable}(x, x') \land \text{perceptually\_similar}(x, x')\}$

Why they matter: Autonomous vehicles face physical adversarial examples (stickers on road signs). Security systems face adversarial patches (on clothing or objects). Physical threat models are more realistic for real-world deployments than unrestricted $\ell_p$ perturbations.

Verification approaches: Some work verifies robustness to specific physical transformations (rotations, lighting changes). Full physical adversarial verification remains largely unsolved.

White-Box vs Black-Box Threat Models

Beyond specifying perturbation sets, threat models also specify adversary knowledge:

White-Box (Perfect Knowledge)

The adversary knows:

Network architecture (all layer types, sizes, connections)
Network parameters (all weights and biases)
Training procedure and data distribution
Verification method being used

White-box attacks compute gradients to find adversarial perturbations efficiently. They represent the strongest attacker: one with complete information.

Verification under white-box threat models assumes the adversary has full access to the network. This is the worst-case assumption---if the network is verified robust under white-box assumptions, it’s also robust under weaker adversary models.

Black-Box (Limited Knowledge)

The adversary has only:

Query access to the network (can submit inputs and observe outputs)
No knowledge of architecture or parameters
No gradient information

Black-box attacks must infer decision boundaries through queries. They’re slower and less efficient than white-box attacks but more realistic for deployed systems.

Verification is typically white-box: Most verification methods assume full knowledge of the network. This conservative assumption means verification proves robustness against the strongest adversaries. If you verify under white-box assumptions, you’re also protected against black-box attackers.

Verification vs Attacker Knowledge

Verification methods typically assume white-box adversaries (strongest threat model). Proving robustness under white-box assumptions implies robustness under weaker assumptions (black-box). This conservative approach provides maximum assurance.

Beyond Adversarial Robustness: Other Verification Properties

While adversarial robustness dominates neural network verification research, other critical properties require formal guarantees. These properties go beyond “robustness to perturbations” and address correctness, safety, and ethical constraints.

Reachability Analysis

Reachability analysis determines what output values a network can produce given constraints on inputs. Unlike robustness (proving outputs don’t change under perturbations), reachability asks: what outputs are reachable?

Formal definition: Given input constraints $\mathcal{X}$ (e.g., $x \in [l, u]$ ), compute the reachable output set:

$\mathcal{Y}_{\text{reach}} = \{f(x) : x \in \mathcal{X}\}$

Why it matters:

Safety verification: Ensure outputs stay within safe bounds. For autonomous vehicle control, verify that steering angles remain in safe ranges given sensor input uncertainties.
Property verification: Check if a network can ever produce invalid outputs (e.g., negative probabilities, out-of-range predictions).
Counterexample search: Identify inputs that cause specific undesirable outputs (e.g., high-confidence misclassifications).

Applications:

Control systems: Verify neural network controllers maintain safe states (e.g., aircraft autopilot never commands extreme maneuvers)
Medical diagnosis: Ensure predicted treatment doses remain within therapeutic ranges given measurement uncertainties
Specification checking: Verify output constraints hold for all valid inputs (e.g., “classifier never outputs class A for inputs in region B”)

Verification approaches:

Abstract interpretation: Over-approximate reachable sets using intervals, zonotopes, or polyhedra (fast but may be loose)
SMT-based methods: Encode reachability as constraint satisfaction; complete but expensive
Symbolic propagation: Compute symbolic expressions for outputs in terms of inputs; tightest for linear regions

Reachability vs. robustness:

Robustness: Given $x_0$ , prove $f(x) = f(x_0)$ for all $x \in \mathcal{X}$ (classification doesn’t change)
Reachability: Given $\mathcal{X}$ , compute all possible $f(x)$ values (what can happen?)

Reachability is more general---robustness is a special case where you verify that $\mathcal{Y}_{\text{reach}}$ satisfies a specific property (e.g., argmax doesn’t change).

Example: For a neural network controller $f: \mathbb{R}^n \to \mathbb{R}^m$ with input sensor readings in $[l, u]$ , reachability analysis computes the range of possible control outputs. If this range exceeds safe thresholds, the controller is unsafe.

Monotonicity Properties

Monotonicity requires that the network’s output increases (or decreases) when specific inputs increase. This captures domain knowledge about input-output relationships.

Formal definition: A network $f$ is monotonically increasing in input dimension $i$ if:

$\forall x, x': x'_i > x_i \land x'_j = x_j \, (\forall j \neq i) \Rightarrow f(x') \geq f(x)$

Why it matters:

Domain constraints: Many domains have natural monotonicity requirements:
- Credit scoring: Higher income should lead to higher credit score (not lower)
- Medical risk: Higher blood pressure should lead to higher disease risk
- Pricing: Higher demand should lead to higher price
Interpretability: Monotonicity makes model behavior predictable and understandable
Fairness: Prevents unexpected inversions (e.g., more education leading to lower predicted salary violates fairness expectations)

Types of monotonicity constraints:

Individual monotonicity: $f$ increases in specific dimension $i$
Partial monotonicity: $f$ increases in subset of dimensions
Joint monotonicity: $f$ increases when multiple specific dimensions increase together
Output monotonicity: Specific output $f_k(x)$ increases when input $x_i$ increases

Verification approaches:

Gradient-based: Verify $\frac{\partial f}{\partial x_i} \geq 0$ everywhere (requires computing gradient bounds)
Symbolic differentiation: Compute symbolic gradient expressions; verify they’re non-negative
Sampling-based: Check monotonicity on finite samples (incomplete but practical)
SMT encoding: Encode non-monotonicity as constraint; unsatisfiable implies monotone

Challenges:

ReLU networks aren’t globally monotone (piecewise linear with varying slopes)
Verifying monotonicity over the full input space is expensive
Joint monotonicity (multiple dimensions) creates exponential complexity

Practical use: Certify that neural network-based credit scoring models respect domain constraints (higher income/employment leads to higher score). This prevents model mistakes that violate common sense and regulatory requirements.

Fairness Properties

Fairness requires that a network treats different demographic groups equitably. Unlike robustness or monotonicity, fairness is explicitly about protected attributes (race, gender, age) and equitable outcomes.

Common fairness definitions:

Individual fairness: Similar individuals should receive similar predictions.

$\|f(x) - f(x')\| \leq L \cdot d(x, x')$

where $d(x, x')$ is a similarity metric that respects fairness (doesn’t heavily weight protected attributes).
Group fairness (demographic parity): Outcomes should be independent of protected attributes.

$P(f(x) = 1 | \text{group A}) = P(f(x) = 1 | \text{group B})$
Equalized odds: True positive and false positive rates should be equal across groups.

$P(f(x) = 1 | y = 1, \text{group A}) = P(f(x) = 1 | y = 1, \text{group B})$
Counterfactual fairness: Predictions shouldn’t change if protected attribute changes (all else equal).

$f(x|A=a) = f(x|A=a') \quad \forall a, a'$

where $A$ is a protected attribute (e.g., gender).

Why fairness verification matters:

Legal compliance: Anti-discrimination laws (GDPR, ECOA) require fair automated decisions
Ethical responsibility: ML systems shouldn’t perpetuate or amplify societal biases
Trust and adoption: Unfair models lose user trust and face regulatory scrutiny

Verification approaches:

Statistical testing: Measure group fairness metrics on test data (incomplete---doesn’t guarantee fairness on unseen inputs)
Constraint-based verification: Encode fairness as formal constraints; verify using SMT or optimization
Counterfactual verification: For each input $x$ , verify that changing protected attribute doesn’t change output
Robustness-based fairness: Treat protected attributes as perturbations; verify local robustness

Connection to robustness: Individual fairness is essentially local robustness with a fairness-aware distance metric. If $x$ and $x'$ differ only in protected attributes, individual fairness requires $f(x) \approx f(x')$ .

Challenges:

Fairness-accuracy tradeoff: Enforcing strict fairness can reduce accuracy
Multiple fairness definitions conflict: Satisfying all fairness notions simultaneously is often impossible
Intersectionality: Fairness across multiple protected attributes (race and gender) is complex
Causality: Fairness requires understanding causal relationships, not just correlations

Practical use: Verify that hiring models, loan approval systems, and criminal risk assessment tools treat protected groups fairly. Formal verification provides stronger guarantees than testing alone---it proves fairness properties hold for all inputs, not just observed data.

Example verification property: For a hiring model $f(x)$ where $x$ includes gender, verify that changing only the gender feature (keeping qualifications constant) doesn’t change the hiring decision:

$\forall x: f(x|\text{gender}=M) = f(x|\text{gender}=F)$

This is a form of counterfactual fairness that can be verified using SMT solvers or bound propagation techniques adapted to track protected attribute influence.

Verification Beyond Adversarial Robustness

While $\ell_p$ robustness dominates neural network verification research, properties like reachability, monotonicity, and fairness are equally critical for trustworthy AI. These properties require different verification techniques and encode domain-specific correctness requirements. A comprehensive verification approach addresses multiple property types, not just adversarial robustness.

Choosing the Right Threat Model

Selecting an appropriate threat model requires understanding your application’s actual threats:

Consider the attack surface:

Digital inputs (uploaded images, text, data): $\ell_p$ perturbations model pixel/feature manipulation
Physical inputs (camera images, sensor data): Physical threat models account for real-world constraints
Interactive systems (chatbots, recommendation systems): Semantic perturbations model realistic user behavior

Consider adversary capabilities:

Sophisticated attackers (security-critical systems): Assume white-box, use tight $\epsilon$
Accidental perturbations (medical imaging equipment variations): Model specific noise patterns or transformations
Environmental variation (weather, lighting): Use physical or semantic threat models

Consider deployment context:

Safety-critical (autonomous vehicles, medical): Use conservative threat models, verify worst-case
General purpose (consumer applications): Balance robustness with performance, accept some uncertainty
Adversarial environments (security, finance): Assume adaptive attackers, verify multiple threat models

Hint: Start conservative, refine based on reality. Begin with white-box $\ell_\infty$ or $\ell_2$ threat models (well-understood, widely verified). If these prove too restrictive or don’t match actual threats, refine to more specific threat models. Document your threat model assumptions clearly---they define the scope of your verification guarantees.

Common Pitfalls

Mismatched threat models: Verifying $\ell_\infty$ robustness when actual threats are rotations provides false security. Always align threat models with realistic attacks.

Overly restrictive budgets: $\epsilon$ too small makes verification meaningless (trivially robust). Too large makes it infeasible (no network can achieve it).

Ignoring practical constraints: $\ell_p$ perturbations allow modifications that aren’t physically realizable. For physical systems, verify against physical threat models.

Single threat model assumption: Real attackers explore multiple attack vectors. Verify against multiple complementary threat models for comprehensive coverage.

Unstated assumptions: “Verified robust” is meaningless without specifying the threat model. Always document: which norm, what $\epsilon$ , white-box or black-box, any additional constraints.

Final Thoughts

The threat model is the foundation of verification. It defines what you’re protecting against and what guarantees verification provides. Choose threat models carefully---they must match your actual deployment threats while remaining tractable for verification.

Modern verification tools support $\ell_\infty$ and $\ell_2$ norms well. Physical and semantic threat models are emerging research areas. As the field matures, more sophisticated threat models will become verifiable, bridging the gap between mathematical perturbations and real-world threats.

Understanding threat models clarifies what verification means for your application. It’s not about proving absolute security---it’s about proving robustness under specified, realistic assumptions. Combined with comprehensive testing and monitoring, threat-model-based verification builds systems you can reasonably trust.

What Is a Threat Model?

ℓp\ell_pℓp​ Norm Threat Models

ℓ∞\ell_\inftyℓ∞​ Threat Model

ℓ2\ell_2ℓ2​ Threat Model

ℓ1\ell_1ℓ1​ Threat Model

ℓ0\ell_0ℓ0​ Threat Model

Beyond ℓp\ell_pℓp​ Norms: Semantic and Physical Threat Models

Semantic Threat Models

Physical Threat Models

White-Box vs Black-Box Threat Models

Beyond Adversarial Robustness: Other Verification Properties

Reachability Analysis

Monotonicity Properties

Fairness Properties

Choosing the Right Threat Model

Common Pitfalls

Final Thoughts

$\ell_p$ Norm Threat Models

$\ell_\infty$ Threat Model

$\ell_2$ Threat Model

$\ell_1$ Threat Model

$\ell_0$ Threat Model

Beyond $\ell_p$ Norms: Semantic and Physical Threat Models