Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/fexpa/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: Accelerate the exponential function

minutes_to_complete: 15

who_is_this_for: This is an introductory topic for developers interested in implementing the exponential function and optimizing it. The Scalable Vector Extension (SVE), introduced with the Armv8-A architecture, includes a dedicated instruction, FEXPA. Although initially not supported in SME, the FEXPA instruction has been made available in Scalable Matrix Extension (SME) version 2.2.

learning_objectives:
- Implementing with SVE intrinsics the exponential function
- Optimizing it with FEXPA

prerequisites:
- An AArch64 computer running Linux or macOS. You can use cloud instances, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/) for a list of cloud service providers.
- Some familiarity with SIMD programming and SVE intrinsics.

author: Arnaud Grasset, Claudio Martino, Alexandre Romana

### Tags
skilllevels: Introductory
subjects: Performance and Architecture
armips:
- Neoverse
operatingsystems:
- Linux
- macOS
tools_software_languages:
- C
- C++

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
title: Conclusion
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Conclusion
The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation.

In conclusion, SME-support for FEXPA lets you embed the expensive exponential approximation directly into the matrix computation path. That translates into:
- Fewer instructions (no back-and-forth to scalar/SVE code)
- Potentially higher aggregate throughput (more exponentials per cycle)
- Lower power & bandwidth (data being kept in SME engine)
- Cleaner fusion with GEMM/GEMV workloads

All of which makes all exponential heavy workloads significantly faster on ARM CPUs.
70 changes: 70 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: FEXPA
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## The FEXPA instruction

Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator.

Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits.

| IEEE754 precision | Idxb | Expb | Remb |
|-------------------------|------|------|------|
| Half precision (FP16) | 5 | 5 | 0 |
| Single precision (FP32) | 6 | 8 | 9 |
| Double precision (FP64) | 6 | 11 | 35 |

The FEXPA instruction can be described for any real number x ∈ [2^(Remb + Expb) + 1; 2^(Remb + Expb) + 2^Expb - 1) as:

$$FEXPA(x)=2^{x-constant}$$

where

$$constant=2^{remBits + expBits} + bias$$

The instruction takes the floating-point value x as an input and, by copying some fraction bits y into the result’s exponent, which is then interpreted as 2^(y-bias), and by getting the correct fraction value from a hardware lookup table using the lower fraction bits, the result becomes 2^(x-constant).

## Usage of lookup tables
Lookup tables can be combined with polynomial approximations. In this approach, the exponential function is reformulated as:

$$e^x = e^{(m \times 2^L + j) \times ln2⁄2^L +r} = 2^m \times (2^{j⁄2^L} + 2^{j⁄2^L} \times p(r)) $$

where

$$r∈[-ln2/2^{L+1}, +ln2/2^{L+1}], j \in [0, 2^L - 1]$$

and p(r) approximates e^r -1.

If the 2^L possible values of 2^(j⁄2^L) are precomputed in table T, the exponential can be approximated as:
$$ e^x = 2^m \times T[j] \times (1 + p(r)) $$

With a table of size 2^L, the evaluation interval for the approximation polynomial is narrowed by a factor of 2^L. This reduction leads to improved accuracy for a given polynomial degree due to the narrower approximation range. Alternatively, for a given accuracy target, the degree of the polynomial—and hence its computational complexity—can be reduced.

## Exponential implementation wth FEXPA
FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy of the implementation we have seen before:

```C
svfloat32_t lane_consts = svld1rq(pg, ln2_lo); // Load only ln2_lo

/* Compute k as round(x/ln2) using shift = 1.5*2^(23-6) + 127 */
svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x, shift);
svfloat32_t k = svsub_x(pg, z, shift);

/* Compute r as x - k*ln2 with Cody and Waite */
svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x);
r = svmls_lane(r, k, lane_consts, 0);

/* Compute the scaling factor 2^k */
svfloat32_t scale = svexpa(svreinterpret_u32(z));

/* Compute poly(r) = exp(r) - 1 (2nd degree polynomial) */
svfloat32_t p01 = svmla_x (pg, svdup_f32(c0), r, svdup_f32(c1)); // c0 + c1 * r
svfloat32_t poly = svmul_x (pg, r, p01); // r c0 + c1 * r^2

/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */
svfloat32_t result = svmla_f32_x(pg, scale, poly, scale);
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: First implementation
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## First implementation
Given what we said in the previous chapters, the exponential function can be implemented with SVE intrinsics in the following way:

```C
svfloat32_t lane_consts = svld1rq(pg, constants); // Load ln2_lo, c0, c2, c4 in register

/* Compute k as round(x/ln2) using shift = 1.5*2^23 + 127 */
svfloat32_t z = svmad_x(pg, svdup_f32(inv_ln2), x, shift);
svfloat32_t k = svsub_x(pg, z, shift);

/* Compute r as x - k*ln2 with Cody and Waite */
svfloat32_t r = svmsb_x(pg, svdup_f32(ln2_hi), k, x);
r = svmls_lane(r, k, lane_consts, 0);

/* Compute the scaling factor 2^k */
svfloat32_t scale = svreinterpret_f32_u32(svlsl_n_u32_m(pg, svreinterpret_u32_f32(z), 23));

/* Compute poly(r) = exp(r) - 1 */
svfloat32_t p12 = svmla_lane(svdup_f32(c1), r, lane_consts, 2); // c1 + c2 * r
svfloat32_t p34 = svmla_lane(svdup_f32(c3), r, lane_consts, 3); // c3 + c4 * r
svfloat32_t r2 = svmul_x(pg, r, r);
svfloat32_t p14 = svmla_x(pg, p12, p34, r2); // c1 + c2 * r + c3 * r^2 + c4 * r^3
svfloat32_t p0 = svmul_lane(r, lane_consts, 1); // c0 * r
svfloat32_t poly = svmla_x(pg, p0, r2, p14); // c0 * r + c1 * r^2 + c2 * r^3 + c3 * r^4 + c4 * r^5

/* exp(x) = scale * exp(r) = scale * (1 + poly(r)) */
svfloat32_t result = svmla_f32_x(pg, scale, poly, scale);
```
62 changes: 62 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/fexpa/theory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: Theory
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## The exponential function
The exponential function is a fundamental mathematical function used across a wide range of algorithms for signal processing, High-Performance Computing and Machine Learning. Optimizing its computation has been the subject of extensive research for decades. The precision of the computation depends both on the selected approximation method and on the inherent rounding errors associated with finite-precision arithmetic, and it is directly traded off against performance when implementing the exponential function.

## Range reduction
Polynomial approximations are among the most widely used methods for software implementations of the exponential function. The accuracy of a Taylor series approximation for exponential function can be improved with the polynomial’s degree but will always deteriorate as the evaluation point moves further from the expansion point. By applying range reduction techniques, the approximation of the exponential function can however be restricted to a very narrow interval where the function is well-conditioned. This approach consists in reformulating the exponential function in the following way:

$$e^x=e^{k×ln2+r}=2^k \times e^r$$

where

$$x=k×ln2+r, k \in Z, r \in [-ln2/2, +ln2/2]$$

Since k is an integer, the evaluation of 2^k can be efficiently performed using bit manipulation techniques, while e^r can be approximated with a polynomial p(r). Hence:

$$e^x \approx 2^k \times p(r)$$

It is important to note that the polynomial p(r) is evaluated exclusively over the interval [-ln2/2, +ln2/2]. So, the computational complexity can be optimized by selecting the polynomial degree based on the required precision of p(r) within this narrow range. Rather than relying on a Taylor polynomial, a minimax polynomial approximation can be used to minimize the maximum approximation error over the considered interval.

## Decomposition of the input
The decomposition of an input value as x = k × ln2 + r can be done in 2 steps:
- Compute k as: k = round(x⁄ln2), where round(.) is the round-to-nearest function
- Compute r as: r = x - k × ln2

Rounding of k is performed by adding an adequately chosen large number to a floating-point value and subtracting it just afterward (the original value is rounded due to the finite precision of floating-point representation). Although explicit rounding instructions are available in both SVE and SME, this method remains advantageous as the addition of the constant can be fused with the multiplication by the reciprocal of ln2. This approach assumes however that the floating-point rounding mode is set to round-to-nearest, which is the default mode in Armv9-A. By integrating the bias into the constant, 2^k can also be directly computed by shifting the intermediate value.

Rounding error during the second step will introduce a global error as we will have:

$$ x \approx k \times ln2 + r $$

To reduce the rounding errors during the computation of the reduced argument r, the Cody and Waite argument reduction technique is used.

## Computation of the scaling factor
By leveraging the structure of floating-point number formats, it becomes relatively straightforward to compute 2^k for k∈Z. In the IEEE-754 standard, normalized floating-point numbers in binary interchange format are represented as:

$$ (-1)^s \times 2^{(exponent - bias)} \times (1.fraction)_2 $$

where s is the sign bit and 1.fraction represents the significand.

The value 2^k can be encoded by setting both the sign and fraction bits to zero and assigning the exponent field the value k + bias. If k is an 8-bits integer, 2^k can be efficiently computed by adding the bias value and positioning the result into the exponent bits of a 32-bit floating-point number using a logical shift.

Taking this approach a step further, a fast approximation of exponential function can be achieved using bits manipulation techniques alone. Specifically, adding a bias to an integer k and shifting the result into the exponent field can be accomplished by computing an integer i as follows:

$$i=2^{23} \times (k+bias) = 2^{23} \times k+2^{23} \times bias$$

This formulation assumes a 23-bit significand, but the method can be generalized to other floating-point precisions.

Now, consider the case where k is a real number. The fractional part of k will propagate into the significand bits of the resulting 2^k approximation. However, this side effect is not detrimental, it effectively acts as a form of linear interpolation, thereby improving the overall accuracy of the approximation. To approximate the exponential function, the following identity can be used:

$$e^x = 2^{x⁄ln2}$$

As previously discussed, this value can be approximated by computing a 32-bit integer:

$$i = 2^{23} \times x⁄ln2 + 2^{23} \times bias = a \times x + b $$