Ruby has more clean and comprehensive object-oriented framework than Python. I started to learn Ruby and embed it with some C program. It is turned out that Ruby is still in its early stage and has lots of stuff to be cleaned. One issue I met is the shebang handling.
In Ruby, when it read the input file, it will check whether the shebang (the 1st if it is started with '#!') refers to itself. Otherwise, it will invoke the shebang to invoke the corresponding executables. But, by "itself", it only refers to "ruby". It will cause errors when your Rub-embedded interpreter cannot be recoganized as "itself." Hence, it is recursively being called and finally fails due to lack of resources.
A simple work-around is to put "ruby" into your Ruby-embedded interpreter name to fake as a Ruby interpreter "itself."
More systemetric solution is to check shebang with argv[0] instead.
Thursday, September 20, 2007
Tuesday, August 21, 2007
TILERA's TILE64 multicore processor
TILE64 multicore processor is announced by TILERA. Compare to Intel's Terascale research prototype, TILE64 not only contains similar number of cores but also has a lot of features.
Amazingly similar, TILE64's interconnect is mesh network which has better scalability than crossbar and better performance than ring. TILE64 features 5 meshes: 2 dedicated for data transfer between tiles and memory (I guess cache-to-cache transfer too.) and other 3 for application use.
The core of each tile is a full featured general-purpose processor compared to other startups' annouced primitive cores. It has not only cache hierarchy but also virtual memory support (MMU and TLB.) The core itself is a 3-way VLIW pipeline. But, most likely, the core should have no FP support considering its current process and targeted applications. The spec says it has totally 5MB cache, each tile likely has 8KB I-cache and 8KB D-cache and 64KB L2 cache (or cache tile.)
Interestingly, TILE64 provide cache coherence and shared L2 cache. It is still unclear how the cache is maintained coherent. But to delivery performance of TILE64, cache hieararchy won't be used for the main data flow due its high latency possibly. Dedicated mesh networks among tiles should be used instead to minimize the cache coherence overhead.
TILE64 provides also friendly developing environment. Linux is ready to run on TILE64.
Till now, only Sun's Niagara2 has comparable features to TILE64 but with a flat memory hierarchy. It is interesting which one is better suited for network processing.
Amazingly similar, TILE64's interconnect is mesh network which has better scalability than crossbar and better performance than ring. TILE64 features 5 meshes: 2 dedicated for data transfer between tiles and memory (I guess cache-to-cache transfer too.) and other 3 for application use.
The core of each tile is a full featured general-purpose processor compared to other startups' annouced primitive cores. It has not only cache hierarchy but also virtual memory support (MMU and TLB.) The core itself is a 3-way VLIW pipeline. But, most likely, the core should have no FP support considering its current process and targeted applications. The spec says it has totally 5MB cache, each tile likely has 8KB I-cache and 8KB D-cache and 64KB L2 cache (or cache tile.)
Interestingly, TILE64 provide cache coherence and shared L2 cache. It is still unclear how the cache is maintained coherent. But to delivery performance of TILE64, cache hieararchy won't be used for the main data flow due its high latency possibly. Dedicated mesh networks among tiles should be used instead to minimize the cache coherence overhead.
TILE64 provides also friendly developing environment. Linux is ready to run on TILE64.
Till now, only Sun's Niagara2 has comparable features to TILE64 but with a flat memory hierarchy. It is interesting which one is better suited for network processing.
Saturday, July 14, 2007
cut vs. awk
I have been igoring 'awk' for a long time and rely on 'cut' to extract section of a file for text processing. But, 'cut' is very limited on the delimiters. However, 'awk' has a more friendly delimiter definition. For example,
| cut -d' ' -f 3
will only treat a single white space as the delimiter, but
| awk '{print $3}'
will treat one or more white spaces as the delimiter and help a lot of output formatted using white space instead of TAB.
will only treat a single white space as the delimiter, but
will treat one or more white spaces as the delimiter and help a lot of output formatted using white space instead of TAB.
Wednesday, July 4, 2007
Cell vs. GPU vs. CPU
A slide presents transistors and chip area for Cell, GPUs, and CPUs. The following are other resource comparisons I am interested.
Cell | HPC Cell | Intel Clovertown | AMD Winsor+ | ATI R580 X1950 | nVidia G80 GTX | |
---|---|---|---|---|---|---|
Memory on Chip | 512KB(L2)+ 256KB(LS)x8 | 512KB(L2)+ 256KB(LS)x8 | 4MB(L2)x2 | 1MB(L2)x2 | (TC) | 16KBx16(SHM)+ (TC) |
Memory off Chip | 25.6 GB/s | 25.6 GB/s | 10.41 GB/s(FSB) | 17.0 GB/s (DDR2-1066x2) | 64.0 GB/s | 86.4 GB/s |
IO | 25.6 GB/s | 25.6 GB/s | 10.41 GB/s (FSB) | 22.4 GB/s (HT) | 4.0 GB/s (PCIex16) | 4.0 GB/s (PCIex16) |
Wednesday, May 16, 2007
G80 more details from R600
R600 came out finally. But, only a middle-end one can be ordered now. The reviews (1, 2, and etc.) didn't show much advantages over NVIDIA's G80. NVIDIA also came out a FUD presentation. It is quite interesting to read the presentation and it reveals more details of G80 and R600.
- G80 has one special function unit per ALU, which is never documented or showed in any presentation from NVIDIA and manuals of CUDA.
- R600 is a super-scalar VLIW architecture, which is different from R580 design and uses 5 scalar units instead of 1 vector unit and 1 scalar unit. That is an evolution step to the architectures used in GPU. The analysis on super-scalar VLIW architecture is fair to claim the efficiency issue. But, G80 is not a true scalar architecture, it's still kind of vector processor and also efficiency issue (and, yes, from somewhat different perspectives.) But, with compiler improvement, R600 will have more advantages if latency is critical to the computation.
- It is reported that R600 is a cache heavy design, most of die size is devoted to SRAM. Beside texture cache and vertex cache, R600 has additional read/write cache to virtualize registers. This will benefit GPGPU (depending on the configuration of that read/write cache.) If such a read/write cache is "real" cache, that will improve the programmability and performance of R600 compared to CUDA on G80. The ever increasing complexity of graphics workload and GPGPU popularity has created demands on memory hierarchy for GPU. R600 may be optimized for stream computing.
Labels:
architecture,
gpu
Tuesday, April 17, 2007
CentOS5 with Acrobat Reader
CentOS5 distribution comes with gtk-2.10. It seems that Acrobat Reader package from adobe.com has problem to handle that. It is due to the function to retrieve the version of gtk. The following patch will get it solved.
--- acroread.orig 2007-04-17 16:47:25.000000000 +0800
+++ acroread 2007-04-17 16:47:55.000000000 +0800
@@ -415,7 +415,7 @@
return 1
fi
- echo $mfile| sed 's/libgtk-x11-\([0-9]*\).0.so.0.\([0-9]\)00.\([0-9]*\)\|\(.*\)/\1\2\3/g'
+ echo $mfile| sed 's/libgtk-x11-\([0-9]*\).0.so.0.\([0-9]*\)00.\([0-9]*\)\|\(.*\)/\1\2\3/g'
return 0
fi
Thursday, March 22, 2007
new R600 photos
VR-Zone posted the new photos of R600-based X2900 XTX. It is very different from the previous photos. Definitely, the previous ones should be the engineering boards. The new one is much shorter and should be the same length as NVIDIA's G8800 GTX. More photos and schematic are also posted on the forum of VR-Zone.
The only question I want to know is when this monster will be available, the exact date.
The only question I want to know is when this monster will be available, the exact date.
Wednesday, March 14, 2007
Welcome back! Vector Processor!
From year 2000, I was keeping eyes on the development of GPGPU, developed GPU-based cache simulator and other GPU-based applications in the middle of year 2006. Now, I learned NVIDIA's new G80 and CUDA. Hey, I have to say, Welcome back, Vector Processor! It is proved further by AMD's Fusion with integration of CPU and GPU and Intel's Larrabee or CGPU.
CUDA reveals many details of the architecture design of G80. I will write more details down.
CUDA reveals many details of the architecture design of G80. I will write more details down.
Sunday, March 11, 2007
Polynomial Evaluation Optimization
Polynomial expressions are commonly used to evaluate various of mathematical functions, such as log(x), sin(x). Motivated by the parallel prefix algorithm, I developed a new polynomial evaluation algorithm (at least, as far as I know) to improve polynomial evaluation on modern processors. The test results show that the new method could achieve nearly 2 times speed-up on x86 32-bit system and 3 times speed-up x86 64-bit system (due to the additional XMM registers available in 64-bit system.) for a simple 6-order polynomial. Higher-order polynomial should be even higher speed-up. Quietly amazing!
Typically, a polynomial
y = sum_{i=0}{n} c_i * x^i
will be strength-reduced by a modern compiler as
y := c_n;
y := y * x + c_{n-1};
...
y := y * x + c0;
Even though such an optimization will reduce the multiplication and addition into the optimal ones, i.e., (n-1) multiplications and additions, but it didn't consider the data dependency among instructions. In this optimization, each instruction will depend on the previous one. It is difficult for compiler to schedule them on a highly pipelined processor and hide high ALU latency (compared to ALU throughput.) However, the polynomial can be re-written in the following way:
y = sum_{i=0}{k-1} c_i * x^i + x^k sum_{i=0}{n-k} c_{k+i} * x^i
i.e., the polynomial is split into two parts, L and R, and it is calculated as L + x^k * R. Obviously, L and R can be computed independently or in parallel to full utilize ALU pipeline. The splitting can be recursively and the final result is a balance tree with the multiplicative-addition operator and hence the computation can be done in log(n) steps if there are up to n/2 processors available. Considering that modern processors have multiple function units and each unit is a highly pipelined one, the number of (virtual) processors are quite high, and so is the speed-up.
Typically, a polynomial
y = sum_{i=0}{n} c_i * x^i
will be strength-reduced by a modern compiler as
y := c_n;
y := y * x + c_{n-1};
...
y := y * x + c0;
Even though such an optimization will reduce the multiplication and addition into the optimal ones, i.e., (n-1) multiplications and additions, but it didn't consider the data dependency among instructions. In this optimization, each instruction will depend on the previous one. It is difficult for compiler to schedule them on a highly pipelined processor and hide high ALU latency (compared to ALU throughput.) However, the polynomial can be re-written in the following way:
y = sum_{i=0}{k-1} c_i * x^i + x^k sum_{i=0}{n-k} c_{k+i} * x^i
i.e., the polynomial is split into two parts, L and R, and it is calculated as L + x^k * R. Obviously, L and R can be computed independently or in parallel to full utilize ALU pipeline. The splitting can be recursively and the final result is a balance tree with the multiplicative-addition operator and hence the computation can be done in log(n) steps if there are up to n/2 processors available. Considering that modern processors have multiple function units and each unit is a highly pipelined one, the number of (virtual) processors are quite high, and so is the speed-up.
Subscribe to:
Posts (Atom)