Sunday, March 11, 2007

Polynomial Evaluation Optimization

Polynomial expressions are commonly used to evaluate various of mathematical functions, such as log(x), sin(x). Motivated by the parallel prefix algorithm, I developed a new polynomial evaluation algorithm (at least, as far as I know) to improve polynomial evaluation on modern processors. The test results show that the new method could achieve nearly 2 times speed-up on x86 32-bit system and 3 times speed-up x86 64-bit system (due to the additional XMM registers available in 64-bit system.) for a simple 6-order polynomial. Higher-order polynomial should be even higher speed-up. Quietly amazing!

Typically, a polynomial
y = sum_{i=0}{n} c_i * x^i
will be strength-reduced by a modern compiler as
y := c_n;
y := y * x + c_{n-1};
...
y := y * x + c0;

Even though such an optimization will reduce the multiplication and addition into the optimal ones, i.e., (n-1) multiplications and additions, but it didn't consider the data dependency among instructions. In this optimization, each instruction will depend on the previous one. It is difficult for compiler to schedule them on a highly pipelined processor and hide high ALU latency (compared to ALU throughput.) However, the polynomial can be re-written in the following way:

y = sum_{i=0}{k-1} c_i * x^i + x^k sum_{i=0}{n-k} c_{k+i} * x^i

i.e., the polynomial is split into two parts, L and R, and it is calculated as L + x^k * R. Obviously, L and R can be computed independently or in parallel to full utilize ALU pipeline. The splitting can be recursively and the final result is a balance tree with the multiplicative-addition operator and hence the computation can be done in log(n) steps if there are up to n/2 processors available. Considering that modern processors have multiple function units and each unit is a highly pipelined one, the number of (virtual) processors are quite high, and so is the speed-up.

No comments: