Beyond ReLU: Verifying Networks with Modern Activation Functions

Beyond ReLU: Verifying Networks with Modern Activation Functions#

ReLU dominated neural network design for over a decade—but the landscape is shifting. Modern architectures use diverse activation functions: transformers employ GeLU, mobile networks optimize with Swish, and classical applications still rely on Sigmoid and Tanh. This diversity brings expressive power and improved performance. But what happens when you try to verify networks using these activations? The verification landscape changes dramatically.

The fundamental challenge is this: neural network verification requires analyzing non-linear functions through linear approximations. Different activation functions have different geometric properties, which means they require different—and sometimes dramatically different—linear bounds. Understanding these differences is crucial for anyone designing networks that need to be formally verified.

In this post, we’ll explore how activation function choices impact verification, examining the mathematical foundations that make some activations verification-friendly and others computationally expensive. By the end, you’ll understand why ReLU remains the verification king, and when the benefits of modern activations justify their verification costs.

The Verification Challenge with Activation Functions#

When we verify a neural network, bound propagation algorithms track the possible range of values at each layer. For linear layers, this is straightforward: output bounds follow directly from input bounds through matrix multiplication. But activation functions are non-linear, and here’s where verification gets interesting.

Why Activation Functions Matter for Verification#

A neural network is fundamentally a composition of linear transformations (matrix multiplications) and non-linear transformations (activation functions). To verify properties of the full network, we must propagate bounds through both—but the activation functions create a bottleneck. The quality of bounds at activation functions directly determines the quality of bounds throughout the entire network. A loose approximation at one activation layer compounds through subsequent layers, potentially making the overall verification impossible.

The Core Problem: Linear Relaxation#

Here’s the central technical challenge: bound propagation requires linear expressions to propagate bounds through layers. But activation functions are non-linear. We can’t propagate non-linear bounds—we need to approximate each activation with linear bounds.

More formally, for an activation function \(\phi(x)\) and input bounds \([l, u]\), we need two linear functions:

\[ \begin{align}\begin{aligned}\text{Lower bound: } \underline{\phi}(x) = \alpha_l x + \beta_l\\\text{Upper bound: } \overline{\phi}(x) = \alpha_u x + \beta_u\\\text{satisfying: } \underline{\phi}(x) \leq \phi(x) \leq \overline{\phi}(x) \quad \forall x \in [l, u]\end{aligned}\end{align} \]

These two linear functions “sandwich” the true activation function. The key requirement: the bounds must be sound—they must fully contain the true activation’s output.

Key Distinction

The tightness of these linear bounds directly determines how tight verification bounds are. Loose bounds → loose verification → possibly unverifiable networks. Tight bounds → tight verification → more networks can be certified.

ReLU - The Verification Baseline#

ReLU has dominated both practice and verification research for good reason. To understand why, let’s look at its mathematical structure.

Why ReLU Became the Standard#

ReLU’s definition is deceptively simple: \(\text{ReLU}(x) = \max(0, x)\). This piecewise linear structure is fundamental to its verification-friendliness.

The function has two linear regions: - For \(x \geq 0\): The function is exactly \(y = x\) - For \(x < 0\): The function is exactly \(y = 0\)

There’s only one “kink” at \(x = 0\). This piecewise linearity is revolutionary for verification because it means we can sometimes compute exact bounds, not approximations.

ReLU’s Linear Bounds#

Given input bounds \([l, u]\), ReLU’s bounds depend on where the interval sits relative to zero:

\[ \begin{align}\begin{aligned}\text{Case 1 - Fully Active (}l \geq 0\text{):}\\\underline{\text{ReLU}}(x) = \overline{\text{ReLU}}(x) = x \quad \text{(exact)}\\\text{Case 2 - Fully Inactive (}u \leq 0\text{):}\\\underline{\text{ReLU}}(x) = \overline{\text{ReLU}}(x) = 0 \quad \text{(exact)}\\\text{Case 3 - Crossing (}l < 0 < u\text{):}\\\underline{\text{ReLU}}(x) = 0\\\overline{\text{ReLU}}(x) = \frac{u}{u - l}(x - l)\end{aligned}\end{align} \]

In Case 3, the upper bound is the secant line (chord) connecting \((\mathbf{l}, 0)\) to \((\mathbf{u}, u)\). This defines a triangle—the convex hull of the ReLU function over the interval—and this triangular relaxation is optimal. No other linear bounds can be tighter without violating soundness.

Key Insight

ReLU’s piecewise linearity enables exact bounds in many cases and optimal (convex hull) bounds in the general case. This is why ReLU enables efficient verification.

ReLU’s Limitations#

Despite its verification advantages, ReLU has architectural limitations. Dead neurons (neurons that always output 0) are common, especially in deeper networks. Gradient flow issues in the inactive region can slow training. These limitations motivate alternative activations despite their verification challenges.

Sigmoid and Tanh - Classical Smooth Activations#

Before ReLU, Sigmoid and Tanh dominated neural networks. These smooth, differentiable activations have bounded outputs, making them useful for many applications. But their mathematical structure creates verification challenges.

Sigmoid: The Logistic Function#

The sigmoid function maps any input to a probability-like output in (0, 1):

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

The S-curve shape is smooth and differentiable everywhere, with an inflection point at the center where the slope is steepest.

To bound sigmoid, we construct linear bounds using geometric approaches like tangent lines and secant lines. The sigmoid derivative is \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\), which is useful in deriving these bounds.

The key challenge is that sigmoid bounds are inherent approximations—they’re not exact like ReLU can be. The fundamental insight: sigmoid bounds depend heavily on the input interval width and are inherently loose compared to piecewise linear activations.

Hint

Sigmoid bounds are tightest near the inflection point (\(x \approx 0\)) where the function is most linear. In the saturation regions (\(x \ll 0\) or \(x \gg 0\)), the bounds become very loose because the function is nearly constant but the linear bounds don’t capture this.

Tanh: Sigmoid’s Symmetric Cousin#

Tanh is essentially a scaled and shifted sigmoid:

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1\]

The relationship is exact: \(\tanh(x) = 2\sigma(2x) - 1\). Tanh outputs are in (-1, 1) and it’s symmetric around the origin, unlike sigmoid’s asymmetric range of (0, 1).

Bounding tanh uses the same strategy as sigmoid: tangent line for lower bound, secant line for upper bound. The bounds have similar tightness properties—good near the center, loose in saturation regions.

Note

Comparing Sigmoid and Tanh: While mathematically related, tanh’s symmetry sometimes makes it easier to bound in certain verification scenarios. However, both suffer from loose bounds in saturation regions, and both are more difficult to verify than ReLU.

GeLU - The Transformer Era Activation#

The rise of transformers brought a new activation to prominence: GeLU (Gaussian Error Linear Unit). Unlike ReLU’s simplicity or sigmoid’s classic smoothness, GeLU has unique properties that create new verification challenges.

What is GeLU?#

GeLU is defined as:

\[\text{GeLU}(x) = x \cdot \Phi(x)\]

where \(\Phi(x)\) is the cumulative distribution function of the standard normal distribution. Intuitively, it’s a probabilistic activation—the neuron’s output is scaled by the probability that a standard normal variate is less than the input.

For computational efficiency, GeLU is typically approximated as:

\[\text{GeLU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right)\]

This approximation is nearly exact and is what most frameworks actually use.

Why Transformers Use GeLU#

GeLU became the standard in transformers (BERT, GPT, etc.) because it has unique mathematical properties. Unlike ReLU, which is monotonically increasing, GeLU is non-monotonic—it has a minimum in the negative region. This non-monotonicity makes GeLU a smoother approximation to the step function, which better captures attention-like behavior in transformers.

The Verification Challenge: Non-Monotonicity#

Non-monotonicity fundamentally complicates verification. With monotonic activations (ReLU, sigmoid, tanh), we can use simple strategies: minimum at lower bound, maximum at upper bound. But with GeLU, the minimum might be in the middle of our input interval.

Consider an input interval where the true minimum is in the interior. A naive linear bound that only uses endpoints will miss this interior minimum and produce overly loose bounds.

To handle this, modern verification tools use the convex hull of the activation function:

\[ \begin{align}\begin{aligned}\text{Convex hull bounds find linear lower and upper bounds}\\\text{that minimize looseness while respecting all interior points}\end{aligned}\end{align} \]

GeLU requires special handling because its non-monotonicity means the convex hull is more complex than simple tangent/secant approaches.

Modern Architecture Alert

GeLU’s adoption in transformers (BERT, GPT, etc.) means modern NLP models are harder to verify than earlier RNN/CNN-based systems. This is an important trade-off: transformers gain expressiveness and performance, but verification becomes more challenging.

Tip

When implementing GeLU verification, use adaptive bounding strategies that detect and handle the minimum region separately. Many tools (like alpha-beta-CROWN) have optimized handling for GeLU and similar non-monotonic activations.

Swish - Mobile-Friendly Activation#

EfficientNets and mobile neural networks popularized Swish, a self-gated activation with interesting properties:

\[\text{Swish}(x) = x \cdot \sigma(\beta x)\]

The parameter \(\beta\) controls smoothness: at \(\beta = 0\), Swish becomes linear; as \(\beta\) increases, it approaches ReLU.

Like GeLU, Swish is non-monotonic (it has a minimum), creating similar verification challenges. The \(\beta\) parameter affects bound tightness: larger \(\beta\) makes the function closer to ReLU and potentially easier to bound, but smaller \(\beta\) makes it smoother and harder to approximate.

Note

β Parameter Effects: The \(\beta\) parameter is a trade-off knob. For training, you might choose \(\beta\) based on performance. For verification, you need to account for how \(\beta\) affects bound quality. Smaller \(\beta\) is more expressive but harder to verify.

Swish is slightly simpler to bound than GeLU in practice, but still requires non-monotonic handling strategies. Most verification tools handle Swish similarly to GeLU.

MaxPool - Not Just an Activation#

While not technically an activation function, MaxPool shares verification challenges similar to non-linear activations. It’s worth understanding because many CNN-based networks must be verified with MaxPool.

MaxPool’s Definition#

MaxPool operates on a neighborhood of values and outputs the maximum:

\[\text{MaxPool}(x_1, \ldots, x_n) = \max_i x_i\]

For a pooling window in a 2D convolution, we compute the maximum over all values within the window.

The Verification Puzzle#

MaxPool creates a combinatorial challenge. For input bounds \([l_i, u_i]\) for each \(x_i\), the naive bounds are:

\[\text{Lower bound: } \max_i l_i, \quad \text{Upper bound: } \max_i u_i\]

However, linear relaxation of maxpool is more complex. We need to encode which input could be the maximum in different cases. The challenge is that multiple inputs might be the maximum in different scenarios, creating a combinatorial explosion.

Recent methods like Ti-Lin and MaxLin compute neuron-wise tightest linear bounds for maxpool by minimizing the over-approximation zone. These bounds depend on the specific input bounds and which inputs could potentially be winners. The solution is to use convex hull bounds that account for all possible maximum scenarios with optimized hyperplanes.

Architectural Note

MaxPool is not an activation function—it’s a pooling operation. However, it’s non-differentiable (the gradient is undefined at the maximum), which creates similar verification challenges. Treat MaxPool similarly to activation functions when verifying networks.

Comparison - The Verification Landscape#

Let’s bring all these activations together and understand their trade-offs.

Activation	Mathematical Property	Monotonic?	Bound Tightness	Computational Cost	Typical Use Cases
ReLU	Piecewise linear	Yes	Excellent (optimal convex hull)	Very Low	CNNs, feedforward networks, most baselines
Sigmoid	Smooth S-curve	Yes	Good (inflection), Poor (saturation)	Low	Gates in RNNs/LSTMs, probability outputs, legacy networks
Tanh	Smooth, symmetric	Yes	Good (center), Poor (saturation)	Low	LSTMs, centered representations
GeLU	Non-monotonic, smooth	No	Moderate (convex hull needed)	Medium	Transformers, BERT, GPT, NLP models
Swish	Non-monotonic, self-gated	No	Moderate (β-dependent)	Medium	EfficientNet, mobile networks
MaxPool	Piecewise linear, multi-input	Yes (per input)	Good (but combinatorial)	Low-Medium	CNNs, spatial downsampling

Verification Impact and Tool Support#

Activation	Best Bounding Strategy	Main Challenge	Tool Support
ReLU	Triangle relaxation	None (architecturally, dead neurons)	Universal (all tools)
Sigmoid	Tangent + secant	Saturation looseness	Good (most modern tools)
Tanh	Tangent + secant	Saturation looseness	Good (most modern tools)
GeLU	Convex hull / adaptive	Non-monotonicity handling	Emerging (alpha-beta-CROWN, recent versions)
Swish	Convex hull / adaptive	Non-monotonicity, β parameterization	Limited (newer tools, workarounds needed)
MaxPool	Convex hull over outcomes	Combinatorial explosion	Good (most tools support pooling)

The Unifying Principle: Convex Hull Approximation#

Beneath all these different bounding strategies lies a unifying mathematical principle: the convex hull. For any activation function over a bounded input interval, the tightest possible linear over-approximation is the convex hull—the smallest convex set containing the function.

ReLU’s triangle relaxation is the convex hull. Sigmoid/Tanh’s tangent/secant bounds approximate the convex hull. GeLU’s adaptive strategies aim for the convex hull. This unified perspective reveals why some activations are inherently harder: non-monotonic activations have more complex convex hulls that require sophisticated algorithms to compute.

Research tools like WraAct (OOPSLA 2025) and WraLU (POPL 2024) focus specifically on computing convex hull approximations efficiently for arbitrary activation functions.

Note

Further reading on verification techniques:

Efficient Neural Network Robustness Certification with General Activation Functions (NeurIPS 2018) - CROWN’s approach to general activations including sigmoid and tanh
An Abstract Domain for Certifying Neural Networks (POPL 2019) - DeepPoly’s handling of ReLU, sigmoid, tanh, and maxpool
Convex Hull Approximation for Activation Functions (OOPSLA 2025) - WraAct general activation framework
ReLU Hull Approximation (POPL 2024) - WraLU convex hull techniques
Achieving the Tightest Relaxation of Sigmoids for Formal Verification (arXiv 2024) - Advanced sigmoid bounding techniques
Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone (arXiv 2022) - MaxPool tight bounds

Hint

When choosing activations for verifiable networks, ask: Is the additional expressiveness worth the verification complexity? For ReLU, the answer is always “yes, verification is efficient.” For modern activations, you’re making a conscious trade-off.

Practical Implications: Choosing Activations for Verification#

Understanding these verification challenges should influence your architectural decisions.

Design Principles for Verifiable Networks#

Default to ReLU: If your application allows it, ReLU remains the gold standard for verification. It offers the best combination of performance and verifiability.
Use modern activations intentionally: Don’t default to GeLU or Swish just because transformers popularized them. If you need the specific benefits—better gradient flow, smoother approximations, performance improvements—then justify the verification cost.
Understand your verification budget: How much computational cost can you accept for verification? ReLU verification can run in minutes; modern activation verification might require hours.
Hybrid strategies: Consider using ReLU for most layers and modern activations only where they provide critical benefits. Or use ReLU in the early layers (where bounds are most critical) and modern activations in later layers (where bounds accumulate less).

Tool Support and Workarounds#

Not all verification tools support all activations equally:

ReLU: Universal support. All tools can verify ReLU networks
Sigmoid/Tanh: Excellent support in all modern tools (CROWN, DeepPoly, alpha-beta-CROWN)
GeLU: Supported in CROWN-based tools (alpha-beta-CROWN, auto_LiRPA). Not in DeepPoly.
Swish: Limited support. Not officially supported in major tools (workarounds needed)

If your tool doesn’t natively support an activation, workarounds exist: - Implement the activation as a custom layer with your own bounds - Use a tool that supports your activation - Approximate the activation with a supported one (though this introduces approximation error)

The Future of Activation Verification#

The field is evolving rapidly:

Learning-based bounds: New approaches learn optimal bounds during training, adapting to the specific activation functions in your network
Custom activation design: Researchers are designing activations specifically for verifiability while maintaining expressiveness
Adaptive approximation: Tools increasingly use adaptive strategies that choose bounding approaches per layer based on the activation and input bounds

This is an area of active research, with new techniques emerging regularly.

Tip

Stay informed about verification tool updates. As new bounding techniques emerge, previously hard-to-verify networks may become tractable. Subscribe to tool releases and research papers in neural network verification.

Important

Always verify the exact network you deploy. Swapping an activation function, even for a “similar” alternative, can invalidate previous verification results. Re-run verification after any architectural changes.

Final Thoughts#

ReLU’s dominance in verification is no accident—it’s a direct consequence of its piecewise linear structure and the resulting optimal convex hull properties. Modern activation functions bring real benefits: better performance, more expressive power, and alignment with cutting-edge architectures. But they come with a verification cost.

The key insight is understanding trade-offs. Sigmoid and Tanh introduce smoothness approximation challenges. GeLU and Swish add non-monotonicity complexity. MaxPool creates combinatorial challenges. Each represents a conscious choice: expressiveness or performance in exchange for verification difficulty.

The convex hull framework unifies how we think about all these activations. Whether ReLU, GeLU, or Swish, we’re ultimately computing linear bounds that approximate the convex hull. Understanding this geometric perspective helps you make informed architectural choices.

Modern verification tools are improving rapidly. What’s hard to verify today may be straightforward tomorrow as new bounding techniques emerge. At the same time, new architectures and activations continue pushing verification boundaries. It’s a dynamic field.

As you design networks that need to be formally verified, remember: 1. ReLU remains your best friend for tractable verification 2. Modern activations are valuable but not free—understand the verification cost 3. The convex hull is the unifying principle for all bounding strategies 4. Tool support varies—check what your verification tool actually supports 5. Hybrid approaches work: ReLU where verification matters most, modern activations where performance benefits justify the cost

Whether you’re building safety-critical systems, certifying robustness properties, or simply curious about the intersection of network design and verification, understanding activation function verification gives you powerful insights into how to build networks that are both expressive and verifiable.

Note

Further reading for comprehensive understanding:

SoK: Certified Robustness for Deep Neural Networks (IEEE S&P 2023) - Comprehensive survey covering activation verification
Searching for Activation Functions (2017) - How Swish was discovered through architecture search
Gaussian Error Linear Units (GELUs) (2016) - Original GeLU paper
Fast and Complete: Enabling Complete Neural Network Verification (ICLR 2021) - alpha-beta-CROWN’s approach to modern verifiers
Formal Security Analysis of Neural Networks using Symbolic Intervals (USENIX Security 2018) - Symbolic interval foundation

Related Topics in This Series

Soundness and Completeness in Neural Network Verification - Theoretical foundations for understanding approximations
Bound Propagation Approaches - How activation functions are handled in verification techniques

Beyond ReLU: Verifying Networks with Modern Activation Functions

Contents

Beyond ReLU: Verifying Networks with Modern Activation Functions#

The Verification Challenge with Activation Functions#

Why Activation Functions Matter for Verification#

The Core Problem: Linear Relaxation#

ReLU - The Verification Baseline#

Why ReLU Became the Standard#

ReLU’s Linear Bounds#

ReLU’s Limitations#

Sigmoid and Tanh - Classical Smooth Activations#

Sigmoid: The Logistic Function#

Tanh: Sigmoid’s Symmetric Cousin#

GeLU - The Transformer Era Activation#

What is GeLU?#

Why Transformers Use GeLU#

The Verification Challenge: Non-Monotonicity#

Swish - Mobile-Friendly Activation#

MaxPool - Not Just an Activation#

MaxPool’s Definition#

The Verification Puzzle#

Comparison - The Verification Landscape#

Verification Impact and Tool Support#

The Unifying Principle: Convex Hull Approximation#

Practical Implications: Choosing Activations for Verification#

Design Principles for Verifiable Networks#

Tool Support and Workarounds#

The Future of Activation Verification#

Final Thoughts#