Phase 2: Methods & Tools Guide 1

Beyond ReLU: Modern Activation Functions

How activation function choices impact neural network verification, from ReLU's optimal convex hull to the challenges of GeLU, Swish, and other modern activations.

ReLU dominated neural network design for over a decade---but the landscape is shifting. Modern architectures use diverse activation functions: transformers employ GeLU, mobile networks optimize with Swish, and classical applications still rely on Sigmoid and Tanh. This diversity brings expressive power and improved performance. But what happens when you try to verify networks using these activations? The verification landscape changes dramatically.

The fundamental challenge is this: neural network verification requires analyzing non-linear functions through linear approximations. Different activation functions have different geometric properties, which means they require different---and sometimes dramatically different---linear bounds. Understanding these differences is crucial for anyone designing networks that need to be formally verified.

In this post, we’ll explore how activation function choices impact verification, examining the mathematical foundations that make some activations verification-friendly and others computationally expensive. By the end, you’ll understand why ReLU remains the verification king, and when the benefits of modern activations justify their verification costs.

Theoretical Framework: Activation Functions as Unary Operations

To understand why activation functions create unique verification challenges, we must first understand the fundamental decomposition of neural network computation (see operation decomposition). Neural networks can be viewed as compositions of two conceptually distinct operation types:

Binary Operations (Affine Transformations):

xi=b+jwjajx_i = b + \sum_{j} w_j \cdot a_j

These linear transformations are exact from a verification perspective. Bound propagation through affine functions is straightforward---no approximation needed.

Unary Operations (Activation Functions):

ai=ϕ(xi)a_i = \phi(x_i)

These non-linear transformations are the source of all verification complexity. Every approximation, every loss of precision, every verification failure ultimately traces back to how we handle these unary operations.

Why This Decomposition Matters:

The unary/binary decomposition reveals a fundamental asymmetry in verification:

  1. Binary operations (affine) are verification-friendly: linear, exact, compositional. We can propagate bounds through matrix multiplications and additions with perfect precision (modulo numerical errors).

  2. Unary operations (activations) are verification-hard: non-linear, requiring approximation, breaking compositionality. The quality of our activation function approximations determines the quality of our entire verification.

Theoretical Consequence:

For a network with LL layers, each containing nn neurons, we have LnLn activation functions---and each one requires a relaxation. The accumulated looseness from these LnLn approximations compounds through the network. Tighter activation bounds → exponentially tighter output bounds in deep networks.

This is why activation function choice is not just an architectural decision---it’s a fundamental verification bottleneck. The rest of this blog explores how different activation functions create different bottlenecks, and what we can do about it.

The Verification Challenge with Activation Functions

When we verify a neural network, bound propagation algorithms track the possible range of values at each layer. For linear layers, this is straightforward: output bounds follow directly from input bounds through matrix multiplication. But activation functions are non-linear, and here’s where verification gets interesting.

Why Activation Functions Matter for Verification

A neural network is fundamentally a composition of linear transformations (matrix multiplications) and non-linear transformations (activation functions). To verify properties of the full network, we must propagate bounds through both---but the activation functions create a bottleneck. The quality of bounds at activation functions directly determines the quality of bounds throughout the entire network. A loose approximation at one activation layer compounds through subsequent layers, potentially making the overall verification impossible.

The Core Problem: Linear Relaxation

Here’s the central technical challenge: bound propagation requires linear expressions to propagate bounds through layers. But activation functions are non-linear. We can’t propagate non-linear bounds---we need to approximate each activation with linear bounds.

More formally, for an activation function ϕ(x)\phi(x) and input bounds [l,u][l, u], we need two linear functions:

Lower bound: ϕ(x)=αlx+βl\text{Lower bound: } \underline{\phi}(x) = \alpha_l x + \beta_l Upper bound: ϕ(x)=αux+βu\text{Upper bound: } \overline{\phi}(x) = \alpha_u x + \beta_u satisfying: ϕ(x)ϕ(x)ϕ(x)x[l,u]\text{satisfying: } \underline{\phi}(x) \leq \phi(x) \leq \overline{\phi}(x) \quad \forall x \in [l, u]

These two linear functions “sandwich” the true activation function. The key requirement: the bounds must be sound---they must fully contain the true activation’s output.

Key Distinction: The tightness of these linear bounds directly determines how tight verification bounds are. Loose bounds → loose verification → possibly unverifiable networks. Tight bounds → tight verification → more networks can be certified.

ReLU - The Verification Baseline

ReLU has dominated both practice and verification research for good reason. To understand why, let’s look at its mathematical structure.

Why ReLU Became the Standard

ReLU’s definition is deceptively simple: ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x). This piecewise linear structure is fundamental to its verification-friendliness.

The function has two linear regions:

  • For x0x \geq 0: The function is exactly y=xy = x
  • For x<0x < 0: The function is exactly y=0y = 0

There’s only one “kink” at x=0x = 0. This piecewise linearity is revolutionary for verification because it means we can sometimes compute exact bounds, not approximations.

ReLU’s Linear Bounds

Given input bounds [l,u][l, u], ReLU’s bounds depend on where the interval sits relative to zero. The formal definition from the verification literature distinguishes between stable and unstable neurons:

Definition: ReLU Polytope for Unstable Neurons

For neuron z=ReLU(z^)z = \text{ReLU}(\hat{z}), let ll and uu be the lower and upper bounds of its pre-activation z^\hat{z}. If l<0<ul < 0 < u (unstable neuron), the ReLU can be bounded by the following linear constraints:

z0,zz^,zuul(z^l)z \geq 0, \quad z \geq \hat{z}, \quad z \leq \frac{u}{u - l}(\hat{z} - l)

These constraints define a region called the ReLU polytope. When both ll and uu are tight, the polytope is the tightest convex hull for this neuron.

The three cases are:

Case 1 - Fully Active (l0):\text{Case 1 - Fully Active (}l \geq 0\text{):} ReLU(x)=ReLU(x)=x(exact)\underline{\text{ReLU}}(x) = \overline{\text{ReLU}}(x) = x \quad \text{(exact)} Case 2 - Fully Inactive (u0):\text{Case 2 - Fully Inactive (}u \leq 0\text{):} ReLU(x)=ReLU(x)=0(exact)\underline{\text{ReLU}}(x) = \overline{\text{ReLU}}(x) = 0 \quad \text{(exact)} Case 3 - Unstable (l<0<u):\text{Case 3 - Unstable (}l < 0 < u\text{):} ReLU(x)=0\underline{\text{ReLU}}(x) = 0 ReLU(x)=uul(xl)\overline{\text{ReLU}}(x) = \frac{u}{u - l}(x - l)

In Case 3, the upper bound is the secant line (chord) connecting (l,0)(l, 0) to (u,u)(u, u). This defines a triangle---the convex hull of the ReLU function over the interval---and this triangular relaxation is optimal. No other linear bounds can be tighter without violating soundness.

Key Insight: ReLU’s piecewise linearity enables exact bounds in many cases and optimal (convex hull) bounds in the general case. This is why ReLU enables efficient verification.

Formal Proof: Triangle Relaxation Optimality

Theorem (Triangle Relaxation is Optimal Convex Hull): For ReLU activation y=max(0,x)y = \max(0, x) over interval [l,u][l, u] with l<0<ul < 0 < u, the triangle relaxation defined by:

y0,yx,yuul(xl)y \geq 0, \quad y \geq x, \quad y \leq \frac{u}{u - l}(x - l)

is the tightest possible convex linear relaxation. No sound single-neuron linear bounds can be tighter.

Proof:

We prove this by showing the triangle region is exactly the convex hull of the ReLU function over [l,u][l, u].

Step 1: ReLU over [l,u] defines a non-convex region

The true ReLU function over [l,u][l, u] traces two line segments:

  • For x[l,0]x \in [l, 0]: y=0y = 0 (horizontal segment from (l,0)(l, 0) to (0,0)(0, 0))
  • For x[0,u]x \in [0, u]: y=xy = x (diagonal segment from (0,0)(0, 0) to (u,u)(u, u))

This is non-convex: the set of points {(x,max(0,x)):x[l,u]}\{(x, \max(0, x)) : x \in [l, u]\} forms an L-shape.

Step 2: Convex hull construction

The convex hull is the smallest convex set containing all points on the ReLU curve. For our L-shaped curve, the convex hull is the triangle with vertices:

V1=(l,0),V2=(0,0),V3=(u,u)V_1 = (l, 0), \quad V_2 = (0, 0), \quad V_3 = (u, u)

This triangle is bounded by three linear constraints:

  1. Lower bound y0y \geq 0: Horizontal line through V1V_1 and V2V_2
  2. Lower bound yxy \geq x: Line through V2V_2 and V3V_3 (note: this is only active for x0x \geq 0, but we include it for completeness)
  3. Upper bound yuul(xl)y \leq \frac{u}{u-l}(x - l): Secant line (chord) through V1V_1 and V3V_3

Step 3: These bounds are minimal

Any tighter linear bounds would violate soundness:

  • Lower bound y>0y > 0: Would exclude the true ReLU point (l,0)(l, 0), violating soundness
  • Upper bound slope > uul\frac{u}{u-l}: Would exclude points on the diagonal y=xy = x for x[0,u]x \in [0, u]
  • Upper bound slope < uul\frac{u}{u-l}: Would exclude the point (l,0)(l, 0) on the horizontal segment

Conclusion: The triangle relaxation is the convex hull---the tightest possible convex linear over-approximation. \square

Implication: This theorem establishes a convex barrier for single-neuron ReLU verification. Methods using triangle relaxation (DeepPoly, CROWN, Fast-Lin) are already optimal for individual neurons. Tightness improvements require multi-neuron analysis (see theoretical barriers).

Complexity Implications: Why ReLU Causes Exponential Branching

While the triangle relaxation is optimal for incomplete verification, complete verification of ReLU networks has exponential complexity. Understanding why requires analyzing ReLU’s piecewise structure.

Theorem (Exponential Branching for Complete ReLU Verification): For a neural network with nn ReLU neurons, complete verification (finding exact output bounds) requires exploring 2n2^n activation patterns in the worst case.

Proof:

Each ReLU neuron ii with pre-activation xi[li,ui]x_i \in [l_i, u_i] where li<0<uil_i < 0 < u_i can be in two states:

  • Active: xi>0yi=xix_i > 0 \Rightarrow y_i = x_i
  • Inactive: xi0yi=0x_i \leq 0 \Rightarrow y_i = 0

For nn unstable ReLUs (those with li<0<uil_i < 0 < u_i), there are 2n2^n possible activation patterns. Each pattern defines a different linear region of the network function.

The network’s output in each linear region is determined by which ReLUs are active/inactive. To find the exact maximum or minimum output, a complete verifier must:

  1. Enumerate all 2n2^n activation patterns
  2. For each pattern, solve a linear program over that region
  3. Take the maximum/minimum across all regions

Consequence: Complete methods like Reluplex, Marabou, and MILP-based approaches have exponential worst-case complexity. This is why:

  • Small networks (fewer than 1000 ReLUs) are tractable
  • Large networks (more than 10000 ReLUs) are intractable without incomplete relaxations

Connection to MILP Encoding:

MILP-based complete verification encodes each ReLU with a binary integer variable δi{0,1}\delta_i \in \{0, 1\}:

yi0yixiyixili(1δi)yiuiδi\begin{aligned} y_i &\geq 0 \\ y_i &\geq x_i \\ y_i &\leq x_i - l_i(1 - \delta_i) \\ y_i &\leq u_i \delta_i \end{aligned}

The binary variables δi\delta_i encode activation states. MILP solvers use branch-and-bound to explore the 2n2^n combinations---essentially navigating the exponential space implicitly rather than explicitly.

Why Incomplete Methods Avoid Exponential Complexity:

Incomplete methods (LP-based, abstract interpretation) use the triangle relaxation instead:

  • No binary variables → Linear program (polynomial time)
  • Over-approximates all 2n2^n regions with a single convex polytope
  • Trades exactness (completeness) for tractability (polynomial complexity)

This is the fundamental soundness-completeness-scalability tradeoff (see theoretical barriers).

ReLU’s Limitations

Despite its verification advantages, ReLU has architectural limitations. Dead neurons (neurons that always output 0) are common, especially in deeper networks. Gradient flow issues in the inactive region can slow training. These limitations motivate alternative activations despite their verification challenges.

Sigmoid and Tanh - Classical Smooth Activations

Before ReLU, Sigmoid and Tanh dominated neural networks. These smooth, differentiable activations have bounded outputs, making them useful for many applications. But their mathematical structure creates verification challenges.

Sigmoid: The Logistic Function

The sigmoid function maps any input to a probability-like output in (0, 1):

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

The S-curve shape is smooth and differentiable everywhere, with an inflection point at the center where the slope is steepest.

To bound sigmoid, we construct linear bounds using geometric approaches like tangent lines and secant lines. The sigmoid derivative is σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)), which is useful in deriving these bounds.

The key challenge is that sigmoid bounds are inherent approximations---they’re not exact like ReLU can be. The fundamental insight: sigmoid bounds depend heavily on the input interval width and are inherently loose compared to piecewise linear activations.

Mathematical Analysis of Sigmoid Approximation

For interval [l,u][l, u], optimal linear bounds for sigmoid follow a tangent-secant strategy:

Upper Bound (Secant Line):

The secant (chord) connecting (l,σ(l))(l, \sigma(l)) to (u,σ(u))(u, \sigma(u)) provides a sound upper bound. The slope is:

λu=σ(u)σ(l)ul\lambda_u = \frac{\sigma(u) - \sigma(l)}{u - l}

The upper bound linear function is:

σ(x)=λu(xl)+σ(l)=σ(u)σ(l)ul(xl)+σ(l)\overline{\sigma}(x) = \lambda_u (x - l) + \sigma(l) = \frac{\sigma(u) - \sigma(l)}{u - l}(x - l) + \sigma(l)

Why secant for upper bound? Sigmoid is concave on (,0](-\infty, 0] and convex on [0,)[0, \infty). The secant line lies above the sigmoid curve by convexity (when both endpoints are in the same monotone region) or provides a sound over-approximation when crossing the inflection point.

Lower Bound (Tangent Line):

The tangent line provides the tightest lower bound. But which tangent? CROWN uses an optimized tangent point d[l,u]d \in [l, u] chosen to minimize over-approximation.

The tangent at point dd has slope σ(d)=σ(d)(1σ(d))\sigma'(d) = \sigma(d)(1 - \sigma(d)):

σ(x)=σ(d)(xd)+σ(d)\underline{\sigma}(x) = \sigma'(d) \cdot (x - d) + \sigma(d)

Optimizing the Tangent Point:

CROWN finds the optimal dd^* by solving:

d=argmind[l,u]Area(over-approximation region)d^* = \arg\min_{d \in [l, u]} \text{Area}(\text{over-approximation region})

This minimization ensures the tightest possible tangent lower bound. The key insight: the optimal tangent point is not necessarily at the interval center---it depends on the sigmoid’s curvature over [l,u][l, u].

CROWN’s Binary Search Algorithm:

Computing dd^* analytically is difficult due to sigmoid’s transcendental nature. CROWN uses binary search:

Algorithm: Binary Search for Optimal Tangent Point

Input: Interval [l, u], tolerance ε
Output: Optimal tangent point d*

1. Initialize: d_low = l, d_high = u
2. While (d_high - d_low) > ε:
   a. d_mid = (d_low + d_high) / 2
   b. Compute tangent at d_mid: T_mid(x) = σ'(d_mid)(x - d_mid) + σ(d_mid)
   c. Evaluate area: A_mid = ∫[l,u] max(0, σ(x) - T_mid(x)) dx
   d. Compute areas for d_mid ± δ (small perturbation)
   e. If A_left < A_mid: d_high = d_mid
      Else if A_right < A_mid: d_low = d_mid
      Else: break (local minimum found)
3. Return d* = d_mid

Complexity Analysis:

  • Per-neuron cost: O(log(1/ϵ))O(\log(1/\epsilon)) iterations (binary search)
  • Each iteration evaluates sigmoid and its derivative (constant time)
  • Total for n sigmoid neurons: O(nlog(1/ϵ))O(n \log(1/\epsilon))

This is efficient---logarithmic in precision, linear in network size. However, it’s still more expensive than ReLU’s constant-time triangle relaxation.

Error Bounds for Sigmoid Approximation:

The approximation error depends on interval width w=ulw = u - l and the sigmoid’s curvature:

Theorem (Sigmoid Approximation Error): For interval [l,u][l, u] with width w=ulw = u - l, the maximum distance between the true sigmoid and its linear bounds is:

Errorw28maxx[l,u]σ(x)\text{Error} \leq \frac{w^2}{8} \cdot \max_{x \in [l,u]} |\sigma''(x)|

Where σ(x)=σ(x)(12σ(x))\sigma''(x) = \sigma'(x)(1 - 2\sigma(x)) is the second derivative.

Consequence:

  • Narrow intervals (ww small) → quadratic error reduction → tight bounds
  • Wide intervals (ww large) → large error → loose bounds
  • Near inflection point (x0x \approx 0): σ(x)|\sigma''(x)| is maximal → worst-case error
  • In saturation (x0|x| \gg 0): σ(x)0|\sigma''(x)| \approx 0 → sigmoid nearly linear → tighter bounds

This explains why sigmoid verification is harder than ReLU: unavoidable approximation error that scales quadratically with interval width.

Hint: Sigmoid bounds are tightest near the inflection point (x0x \approx 0) where the function is most linear. In the saturation regions (x0x \ll 0 or x0x \gg 0), the bounds become very loose because the function is nearly constant but the linear bounds don’t capture this.

Tanh: Sigmoid’s Symmetric Cousin

Tanh is essentially a scaled and shifted sigmoid:

tanh(x)=exexex+ex=2σ(2x)1\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1

The relationship is exact: tanh(x)=2σ(2x)1\tanh(x) = 2\sigma(2x) - 1. Tanh outputs are in (-1, 1) and it’s symmetric around the origin, unlike sigmoid’s asymmetric range of (0, 1).

Bounding tanh uses the same strategy as sigmoid: tangent line for lower bound, secant line for upper bound. The bounds have similar tightness properties---good near the center, loose in saturation regions.

Mathematical Analysis of Tanh Approximation

Tanh shares sigmoid’s approximation strategy but with subtle differences due to symmetry. For interval [l,u][l, u]:

Upper Bound (Secant):

tanh(x)=tanh(u)tanh(l)ul(xl)+tanh(l)\overline{\tanh}(x) = \frac{\tanh(u) - \tanh(l)}{u - l}(x - l) + \tanh(l)

Lower Bound (Optimized Tangent):

tanh(x)=tanh(d)(xd)+tanh(d)\underline{\tanh}(x) = \tanh'(d^*) \cdot (x - d^*) + \tanh(d^*)

where d[l,u]d^* \in [l, u] is found via binary search (same algorithm as sigmoid).

Tanh derivative: tanh(x)=1tanh2(x)=sech2(x)\tanh'(x) = 1 - \tanh^2(x) = \text{sech}^2(x)

Key Difference from Sigmoid:

Tanh is odd-symmetric: tanh(x)=tanh(x)\tanh(-x) = -\tanh(x). This symmetry provides computational advantages:

  1. Symmetric intervals: If [l,u]=[a,a][l, u] = [-a, a] (symmetric around 0), the optimal tangent is at d=0d^* = 0 (the inflection point)
  2. Error analysis simplifies: Symmetric intervals have predictable error characteristics

Error Bound:

Similar to sigmoid, tanh approximation error scales quadratically with interval width:

Errorw28maxx[l,u]tanh(x)\text{Error} \leq \frac{w^2}{8} \cdot \max_{x \in [l,u]} |\tanh''(x)|

Where tanh(x)=2tanh(x)sech2(x)\tanh''(x) = -2\tanh(x)\text{sech}^2(x).

When Tanh vs. Sigmoid is Tighter:

  • Symmetric intervals ([a,a][-a, a]): Tanh has simpler optimal tangent (at origin), potentially tighter
  • Asymmetric intervals: Both require binary search; tightness depends on curvature
  • Saturation: Both are loose, but tanh saturates symmetrically (±1\pm 1) vs sigmoid asymmetrically (0,1)

Practical implication: For verification, tanh and sigmoid have similar computational cost and tightness. Choice depends on network architecture rather than verification efficiency.

Note: Comparing Sigmoid and Tanh: While mathematically related, tanh’s symmetry sometimes makes it easier to bound in certain verification scenarios. However, both suffer from loose bounds in saturation regions, and both are more difficult to verify than ReLU.

GeLU - The Transformer Era Activation

The rise of transformers brought a new activation to prominence: GeLU (Gaussian Error Linear Unit). Unlike ReLU’s simplicity or sigmoid’s classic smoothness, GeLU has unique properties that create new verification challenges.

What is GeLU?

GeLU is defined as:

GeLU(x)=xΦ(x)\text{GeLU}(x) = x \cdot \Phi(x)

where Φ(x)\Phi(x) is the cumulative distribution function of the standard normal distribution. Intuitively, it’s a probabilistic activation---the neuron’s output is scaled by the probability that a standard normal variate is less than the input.

For computational efficiency, GeLU is typically approximated as:

GeLU(x)0.5x(1+tanh[2π(x+0.044715x3)])\text{GeLU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right)

This approximation is nearly exact and is what most frameworks actually use.

Why Transformers Use GeLU

GeLU became the standard in transformers (BERT, GPT, etc.) because it has unique mathematical properties. Unlike ReLU, which is monotonically increasing, GeLU is non-monotonic---it has a minimum in the negative region. This non-monotonicity makes GeLU a smoother approximation to the step function, which better captures attention-like behavior in transformers.

The Verification Challenge: Non-Monotonicity

Non-monotonicity fundamentally complicates verification. With monotonic activations (ReLU, sigmoid, tanh), we can use simple strategies: minimum at lower bound, maximum at upper bound. But with GeLU, the minimum might be in the middle of our input interval.

Consider an input interval where the true minimum is in the interior. A naive linear bound that only uses endpoints will miss this interior minimum and produce overly loose bounds.

To handle this, modern verification tools use the convex hull of the activation function:

Convex hull bounds find linear lower and upper bounds\text{Convex hull bounds find linear lower and upper bounds} that minimize looseness while respecting all interior points\text{that minimize looseness while respecting all interior points}

GeLU requires special handling because its non-monotonicity means the convex hull is more complex than simple tangent/secant approaches.

Modern Architecture Alert: GeLU’s adoption in transformers (BERT, GPT, etc.) means modern NLP models are harder to verify than earlier RNN/CNN-based systems. This is an important trade-off: transformers gain expressiveness and performance, but verification becomes more challenging.

Tip: When implementing GeLU verification, use adaptive bounding strategies that detect and handle the minimum region separately. Many tools (like alpha-beta-CROWN) have optimized handling for GeLU and similar non-monotonic activations.

Swish - Mobile-Friendly Activation

EfficientNets and mobile neural networks popularized Swish, a self-gated activation with interesting properties:

Swish(x)=xσ(βx)\text{Swish}(x) = x \cdot \sigma(\beta x)

The parameter β\beta controls smoothness: at β=0\beta = 0, Swish becomes linear; as β\beta increases, it approaches ReLU.

Like GeLU, Swish is non-monotonic (it has a minimum), creating similar verification challenges. The β\beta parameter affects bound tightness: larger β\beta makes the function closer to ReLU and potentially easier to bound, but smaller β\beta makes it smoother and harder to approximate.

Note: The β\beta parameter is a trade-off knob. For training, you might choose β\beta based on performance. For verification, you need to account for how β\beta affects bound quality. Smaller β\beta is more expressive but harder to verify.

Swish is slightly simpler to bound than GeLU in practice, but still requires non-monotonic handling strategies. Most verification tools handle Swish similarly to GeLU.

MaxPool - Not Just an Activation

While not technically an activation function, MaxPool shares verification challenges similar to non-linear activations. It’s worth understanding because many CNN-based networks must be verified with MaxPool.

MaxPool’s Definition

MaxPool operates on a neighborhood of values and outputs the maximum:

MaxPool(x1,,xn)=maxixi\text{MaxPool}(x_1, \ldots, x_n) = \max_i x_i

For a pooling window in a 2D convolution, we compute the maximum over all values within the window.

The Verification Puzzle

MaxPool creates a combinatorial challenge. For input bounds [li,ui][l_i, u_i] for each xix_i, the naive bounds are:

Lower bound: maxili,Upper bound: maxiui\text{Lower bound: } \max_i l_i, \quad \text{Upper bound: } \max_i u_i

However, linear relaxation of maxpool is more complex. We need to encode which input could be the maximum in different cases. The challenge is that multiple inputs might be the maximum in different scenarios, creating a combinatorial explosion.

Recent methods like Ti-Lin and MaxLin compute neuron-wise tightest linear bounds for maxpool by minimizing the over-approximation zone. These bounds depend on the specific input bounds and which inputs could potentially be winners. The solution is to use convex hull bounds that account for all possible maximum scenarios with optimized hyperplanes.

Architectural Note: MaxPool is not an activation function---it’s a pooling operation. However, it’s non-differentiable (the gradient is undefined at the maximum), which creates similar verification challenges. Treat MaxPool similarly to activation functions when verifying networks.

Comparison - The Verification Landscape

Let’s bring all these activations together and understand their trade-offs.

ActivationMathematical PropertyMonotonic?Bound TightnessComputational CostTypical Use Cases
ReLUPiecewise linearYesExcellent (optimal convex hull)Very LowCNNs, feedforward networks, most baselines
SigmoidSmooth S-curveYesGood (inflection), Poor (saturation)LowGates in RNNs/LSTMs, probability outputs, legacy networks
TanhSmooth, symmetricYesGood (center), Poor (saturation)LowLSTMs, centered representations
GeLUNon-monotonic, smoothNoModerate (convex hull needed)MediumTransformers, BERT, GPT, NLP models
SwishNon-monotonic, self-gatedNoModerate (beta-dependent)MediumEfficientNet, mobile networks
MaxPoolPiecewise linear, multi-inputYes (per input)Good (but combinatorial)Low-MediumCNNs, spatial downsampling

Verification Impact and Tool Support

ActivationBest Bounding StrategyMain ChallengeTool Support
ReLUTriangle relaxationNone (architecturally, dead neurons)Universal (all tools)
SigmoidTangent + secantSaturation loosenessGood (most modern tools)
TanhTangent + secantSaturation loosenessGood (most modern tools)
GeLUConvex hull / adaptiveNon-monotonicity handlingEmerging (alpha-beta-CROWN, recent versions)
SwishConvex hull / adaptiveNon-monotonicity, beta parameterizationLimited (newer tools, workarounds needed)
MaxPoolConvex hull over outcomesCombinatorial explosionGood (most tools support pooling)

The Unifying Principle: Convex Hull Approximation

Beneath all these different bounding strategies lies a unifying mathematical principle: the convex hull. For any activation function over a bounded input interval, the tightest possible linear over-approximation is the convex hull---the smallest convex set containing the function.

ReLU’s triangle relaxation is the convex hull. Sigmoid/Tanh’s tangent/secant bounds approximate the convex hull. GeLU’s adaptive strategies aim for the convex hull. This unified perspective reveals why some activations are inherently harder: non-monotonic activations have more complex convex hulls that require sophisticated algorithms to compute.

Research tools like WraAct (OOPSLA 2025) and WraLU (POPL 2024) focus specifically on computing convex hull approximations efficiently for arbitrary activation functions.

Comparative Theoretical Analysis: Why ReLU is Verification-Friendly

Having examined individual activations, we can now analyze the fundamental theoretical differences that make some activations easier to verify than others.

Piecewise Linear vs. Smooth Activations

The core distinction is linearity structure:

ReLU (Piecewise Linear):

  • Definition: max(0,x)\max(0, x) --- composed of two linear pieces
  • Linearity: Exactly linear in each region (active/inactive)
  • Convex hull: Finite set of vertices (3 for unstable case: (l,0),(0,0),(u,u)(l,0), (0,0), (u,u))
  • Relaxation: Triangle constraints are exact convex hull (optimal)
  • Complexity: Constant-time bound computation per neuron

Sigmoid/Tanh (Smooth):

  • Definition: σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}) --- transcendental function
  • Linearity: Nowhere linear (smooth S-curve)
  • Convex hull: Infinite boundary (continuous curve, not finitely many vertices)
  • Relaxation: Tangent/secant approximates convex hull (suboptimal)
  • Complexity: O(log(1/ϵ))O(\log(1/\epsilon)) binary search for optimal tangent

Theorem (ReLU vs. Smooth Activation Fundamental Difference):

For piecewise linear activations (ReLU), the convex hull over any interval [l,u][l, u] is a polytope with finitely many vertices. Linear relaxation can be exact (achieving the convex hull).

For smooth activations (sigmoid, tanh), the convex hull boundary is a continuous curve. Any linear relaxation is strictly suboptimal---there is always non-zero approximation error.

Proof Sketch:

  • ReLU: Two linear pieces → convex hull is intersection of half-planes → polytope with 3 vertices (unstable case)
  • Sigmoid: Infinitely differentiable → convex hull boundary is smooth curve → cannot be represented exactly by finite linear constraints
  • Any tangent/secant bounds sigmoid → gap between linear bound and true curve → non-zero error

Consequence: ReLU’s piecewise linearity enables zero-error relaxation (for stable neurons) or optimal-error relaxation (for unstable neurons). Smooth activations always have inherent approximation error.

Approximation Error Comparison Across Activations

Let’s quantify approximation errors for different activations over interval [l,u][l, u] with width w=ulw = u - l.

ActivationError ScalingOptimal Case ErrorWorst Case ErrorDependent On
ReLU (stable)O(1) constant0 (exact)0 (exact)N/A (always exact)
ReLU (unstable)O(w) linear0 (at vertices)Triangle area proportional to wPre-activation bounds
SigmoidO(w^2) quadraticNear 0 (in saturation)Proportional to w^2 times max curvature (near inflection)Interval width, curvature
TanhO(w^2) quadraticNear 0 (in saturation)Proportional to w^2 times max curvature (near inflection)Interval width, curvature
GeLUO(w^2) quadraticNear 0 (in tail regions)O(w^2) (near minimum)Non-monotonicity handling
SwishO(w^2) quadraticNear 0 (beta-dependent)O(w^2) (near minimum)Beta parameter, interval

Key Insights:

  1. ReLU: Linear error scaling --- Error proportional to interval width ww. Tighter pre-activation bounds → proportionally tighter relaxation.

  2. Smooth activations: Quadratic error scaling --- Error proportional to w2w^2. Doubling interval width → 4x error. This is why wide bounds severely hurt sigmoid/tanh verification.

  3. Non-monotonic (GeLU, Swish): Quadratic error + combinatorial --- Not only quadratic error, but also need to handle interior extrema. Requires convex hull computation over non-convex regions.

Theoretical Limits for Each Activation Type

ReLU: Convex Barrier (T4 Tightness Level)

Single-neuron ReLU relaxation (triangle) is optimal (see theoretical barriers). No sound linear relaxation can be tighter without:

  • Multi-neuron analysis (PRIMA: kk-neuron polytopes, T5-T6 tightness)
  • Non-linear relaxations (SDP: quadratic constraints, T7 tightness)
  • Sacrificing soundness (incomplete heuristics)

Fundamental limit: Triangle relaxation is the convex barrier for ReLU. Methods using it (DeepPoly, CROWN, Fast-Lin) are all at T4 tightness---they cannot improve individually without breaking to multi-neuron analysis.

Sigmoid/Tanh: No Convex Barrier (Continuous Improvement Possible)

Unlike ReLU, smooth activations have no finite convex barrier:

  • Binary search precision ϵ\epsilon can be arbitrarily small → O(log(1/ϵ))O(\log(1/\epsilon)) cost
  • Tighter tangent approximations possible with more computation
  • No theoretical limit on single-neuron tightness (only practical convergence limits)

Implication: Sigmoid/tanh verification can be improved arbitrarily by:

  1. Smaller ϵ\epsilon in binary search
  2. Higher-order approximations (quadratic, cubic tangents)
  3. Adaptive refinement strategies

But ReLU verification cannot improve beyond triangle relaxation without multi-neuron analysis.

GeLU/Swish: Non-Monotonicity Barrier

Non-monotonic activations introduce a combinatorial barrier:

  • Interior extrema require case analysis (is minimum inside interval?)
  • Convex hull computation is NP-hard for general non-convex functions
  • Heuristic approximations (adaptive bounds) have no optimality guarantees

Current state: No proven-optimal linear relaxation exists for GeLU/Swish. All methods are heuristic approximations.

Unified Framework: Verification Difficulty Hierarchy

ReLU (stable)[Easiest: exact, constant-time]<ReLU (unstable)[Easy: optimal convex hull, constant-time]<Sigmoid/Tanh[Medium: suboptimal, O(log(1/\varepsilon)) time]<GeLU/Swish[Hard: heuristic, non-monotonic handling]\begin{aligned} &\text{ReLU (stable)} \quad &&\text{[Easiest: exact, constant-time]} \\ &\quad < \text{ReLU (unstable)} \quad &&\text{[Easy: optimal convex hull, constant-time]} \\ &\quad < \text{Sigmoid/Tanh} \quad &&\text{[Medium: suboptimal, O(log(1/\varepsilon)) time]} \\ &\quad < \text{GeLU/Swish} \quad &&\text{[Hard: heuristic, non-monotonic handling]} \end{aligned}

This hierarchy reflects both computational cost (time complexity) and approximation quality (error scaling).

Note: Further reading on verification techniques:

Hint: When choosing activations for verifiable networks, ask: Is the additional expressiveness worth the verification complexity? For ReLU, the answer is always “yes, verification is efficient.” For modern activations, you’re making a conscious trade-off.

Practical Implications: Choosing Activations for Verification

Understanding these verification challenges should influence your architectural decisions.

Design Principles for Verifiable Networks

  1. Default to ReLU: If your application allows it, ReLU remains the gold standard for verification. It offers the best combination of performance and verifiability.

  2. Use modern activations intentionally: Don’t default to GeLU or Swish just because transformers popularized them. If you need the specific benefits---better gradient flow, smoother approximations, performance improvements---then justify the verification cost.

  3. Understand your verification budget: How much computational cost can you accept for verification? ReLU verification can run in minutes; modern activation verification might require hours.

  4. Hybrid strategies: Consider using ReLU for most layers and modern activations only where they provide critical benefits. Or use ReLU in the early layers (where bounds are most critical) and modern activations in later layers (where bounds accumulate less).

Tool Support and Workarounds

Not all verification tools support all activations equally:

  • ReLU: Universal support. All tools can verify ReLU networks
  • Sigmoid/Tanh: Excellent support in all modern tools (CROWN, DeepPoly)
  • GeLU: Supported in CROWN-based tools (auto_LiRPA). Limited in other tools.
  • Swish: Limited support. Not officially supported in major tools (workarounds needed)

If your tool doesn’t natively support an activation, workarounds exist:

  • Implement the activation as a custom layer with your own bounds
  • Use a tool that supports your activation
  • Approximate the activation with a supported one (though this introduces approximation error)

The Future of Activation Verification

The field is evolving rapidly:

  • Learning-based bounds: New approaches learn optimal bounds during training, adapting to the specific activation functions in your network
  • Custom activation design: Researchers are designing activations specifically for verifiability while maintaining expressiveness
  • Adaptive approximation: Tools increasingly use adaptive strategies that choose bounding approaches per layer based on the activation and input bounds

This is an area of active research, with new techniques emerging regularly.

Tip: Stay informed about verification tool updates. As new bounding techniques emerge, previously hard-to-verify networks may become tractable. Subscribe to tool releases and research papers in neural network verification.

Important: Always verify the exact network you deploy. Swapping an activation function, even for a “similar” alternative, can invalidate previous verification results. Re-run verification after any architectural changes.

Final Thoughts

ReLU’s dominance in verification is no accident---it’s a direct consequence of its piecewise linear structure and the resulting optimal convex hull properties. Modern activation functions bring real benefits: better performance, more expressive power, and alignment with cutting-edge architectures. But they come with a verification cost.

The key insight is understanding trade-offs. Sigmoid and Tanh introduce smoothness approximation challenges. GeLU and Swish add non-monotonicity complexity. MaxPool creates combinatorial challenges. Each represents a conscious choice: expressiveness or performance in exchange for verification difficulty.

The convex hull framework unifies how we think about all these activations. Whether ReLU, GeLU, or Swish, we’re ultimately computing linear bounds that approximate the convex hull. Understanding this geometric perspective helps you make informed architectural choices.

Modern verification tools are improving rapidly. What’s hard to verify today may be straightforward tomorrow as new bounding techniques emerge. At the same time, new architectures and activations continue pushing verification boundaries. It’s a dynamic field.

As you design networks that need to be formally verified, remember:

  1. ReLU remains your best friend for tractable verification
  2. Modern activations are valuable but not free---understand the verification cost
  3. The convex hull is the unifying principle for all bounding strategies
  4. Tool support varies---check what your verification tool actually supports
  5. Hybrid approaches work: ReLU where verification matters most, modern activations where performance benefits justify the cost

Whether you’re building safety-critical systems, certifying robustness properties, or simply curious about the intersection of network design and verification, understanding activation function verification gives you powerful insights into how to build networks that are both expressive and verifiable.

Further Reading

This guide provides comprehensive coverage of how activation functions impact neural network verification. For readers interested in diving deeper, we recommend the following resources organized by topic:

Foundational Verification Methods:

The foundation of activation function handling in verification is established by linear relaxation approaches. CROWN introduced efficient linear bound propagation for general activations including ReLU, sigmoid, and tanh. DeepPoly developed abstract interpretation techniques for certifying networks with ReLU, sigmoid, tanh, and maxpool. The theoretical tightness of linear relaxations is analyzed by Salman et al.

Automated Verification Frameworks:

Modern verification frameworks provide automatic support for diverse activation functions. The auto_LiRPA library enables automatic linear relaxation for arbitrary compositions of activations, making it practical to verify networks with modern activation functions.

Activation Functions in Modern Architectures:

Understanding the original motivations for different activation functions provides context for verification challenges. The design and discovery of modern activation functions through architecture search introduced non-monotonic properties that complicate verification.

Theoretical Foundations:

The ReLU polytope and convex hull framework provide the mathematical foundation for understanding activation function bounding. Recent work focuses on computing tight convex hull approximations efficiently for arbitrary activation functions.

Related Topics:

For understanding the broader context of verification, see bound propagation for how these activation bounds propagate through networks, and soundness and completeness for the theoretical guarantees these methods provide. For practical considerations, see verification scalability to understand how activation choices affect verification runtime.

Next Guide: Continue to Bound Propagation to learn about IBP, CROWN, DeepPoly, and other bound propagation techniques that form the foundation of scalable neural network verification.