Thursday, March 22, 2007

new R600 photos

VR-Zone posted the new photos of R600-based X2900 XTX. It is very different from the previous photos. Definitely, the previous ones should be the engineering boards. The new one is much shorter and should be the same length as NVIDIA's G8800 GTX. More photos and schematic are also posted on the forum of VR-Zone.

The only question I want to know is when this monster will be available, the exact date.

Wednesday, March 14, 2007

Welcome back! Vector Processor!

From year 2000, I was keeping eyes on the development of GPGPU, developed GPU-based cache simulator and other GPU-based applications in the middle of year 2006. Now, I learned NVIDIA's new G80 and CUDA. Hey, I have to say, Welcome back, Vector Processor! It is proved further by AMD's Fusion with integration of CPU and GPU and Intel's Larrabee or CGPU.

CUDA reveals many details of the architecture design of G80. I will write more details down.

Sunday, March 11, 2007

Polynomial Evaluation Optimization

Polynomial expressions are commonly used to evaluate various of mathematical functions, such as log(x), sin(x). Motivated by the parallel prefix algorithm, I developed a new polynomial evaluation algorithm (at least, as far as I know) to improve polynomial evaluation on modern processors. The test results show that the new method could achieve nearly 2 times speed-up on x86 32-bit system and 3 times speed-up x86 64-bit system (due to the additional XMM registers available in 64-bit system.) for a simple 6-order polynomial. Higher-order polynomial should be even higher speed-up. Quietly amazing!

Typically, a polynomial
y = sum_{i=0}{n} c_i * x^i
will be strength-reduced by a modern compiler as
y := c_n;
y := y * x + c_{n-1};
...
y := y * x + c0;

Even though such an optimization will reduce the multiplication and addition into the optimal ones, i.e., (n-1) multiplications and additions, but it didn't consider the data dependency among instructions. In this optimization, each instruction will depend on the previous one. It is difficult for compiler to schedule them on a highly pipelined processor and hide high ALU latency (compared to ALU throughput.) However, the polynomial can be re-written in the following way:

y = sum_{i=0}{k-1} c_i * x^i + x^k sum_{i=0}{n-k} c_{k+i} * x^i

i.e., the polynomial is split into two parts, L and R, and it is calculated as L + x^k * R. Obviously, L and R can be computed independently or in parallel to full utilize ALU pipeline. The splitting can be recursively and the final result is a balance tree with the multiplicative-addition operator and hence the computation can be done in log(n) steps if there are up to n/2 processors available. Considering that modern processors have multiple function units and each unit is a highly pipelined one, the number of (virtual) processors are quite high, and so is the speed-up.