2022-09-24

Best non-trigonometric floating point approximation of tanh(x) in 10 instructions or less

Description

I need a reasonably accurate fast hyperbolic tangent for a machine that has no built-in floating point trigonometry, so e.g. the usual tanh(x) = (exp(2x) - 1) / (exp(2x) + 1) formula is going to need an approximation of exp(2x).
All other instructions like addition, subtraction, multiplication, division, and even FMA (= MUL+ADD in 1 op) are present.

Right now I have several approximations, but none of them are satisfactory in terms of accuracy.

[Update from the comments:]

  • The instruction for trunc()/floor() is available
  • There is a way to transparently reinterpret floats as integers and do all kinds of bit ops
  • There is a family of instructions called SEL.xx (.GT, .LE, etc.) which compare 2 values and choose what to write to the destination
  • DIVs are twice as slow, so nothing exceptional, DIVs are okay to use

Approach 1

Accuracy: ±1.2% absolute error, see here.

Pseudocode (A = accumulator register, T = temporary register):

[1] FMA T, 36.f / 73.f, A, A   // T := 36/73 + X^2
[2] MUL A, A, T                // A := X(36/73 + X^2)
[3] ABS T, A                   // T := |X(36/73 + X^2)|
[4] ADD T, T, 32.f / 73.f      // T := |X(36/73 + X^2)| + 32/73
[5] DIV A, A, T                // A := X(36/73 + X^2) / (|X(36/73 + X^2)| + 32/73)

Approach 2

Accuracy: ±0.9% absolute error, see here.

Pseudocode (A = accumulator register, T = temporary register):

[1] FMA T, 3.125f, A, A        // T := 3.125 + X^2
[2] DIV T, 25.125f, T          // T := 25.125/(3.125 + X^2)
[3] MUL A, A, 0.1073f          // A := 0.1073*X
[4] FMA A, A, A, T             // A := 0.1073*X + 0.1073*X*25.125/(3.125 + X^2)
[5] MIN A, A, 1.f              // A := min(0.1073*X + 0.1073*X*25.125/(3.125 + X^2), 1)
[6] MAX A, A, -1.f             // A := max(min(0.1073*X + 0.1073*X*25.125/(3.125 + X^2), 1), -1)

Approach 3

Accuracy: ±0.13% absolute error, see here.

Pseudocode (A = accumulator register, T = temporary register):

[1] FMA T, 14.f, A, A          // T := 14 + X^2
[2] FMA T, -133.f, T, T        // T := (14 + X^2)^2 - 133
[3] DIV T, A, T                // T := X/((14 + X^2)^2 - 133)
[4] FMA A, 52.5f, A, A         // A := 52.5 + X^2
[5] MUL A, A, RSQRT(15.f)      // A := (52.5 + X^2)/sqrt(15)
[6] FMA A, -120.75f, A, A      // A := (52.5 + X^2)^2/15 - 120.75
[7] MUL A, A, T                // A := ((52.5 + X^2)^2/15 - 120.75)*X/((14 + X^2)^2 - 133)
[8] MIN A, A, 1.f              // A := min(((52.5 + X^2)^2/15 - 120.75)*X/((14 + X^2)^2 - 133), 1)
[9] MAX A, A, -1.f             // A := max(min(((52.5 + X^2)^2/15 - 120.75)*X/((14 + X^2)^2 - 133), 1), -1)

The question

Is there anything better that can possibly fit in 10 non-trigonometric float32 instructions?



No comments:

Post a Comment