Berkeley Lab Researchers Analyze Performance, Potential of Cell Processor
May 30, 2006
BERKELEY, Calif. — Though it was designed as the heart of the upcoming Sony PlayStation3 game console, the STI Cell processor has created quite a stir in the computational science community, where the processor’s potential as a building block for high performance computers has been widely discussed and speculated upon.
To evaluate Cell’s potential, computer scientists at the U.S. Department of Energy’s Lawrence Berkeley National Laboratory evaluated the processor’s performance in running several scientific application kernels, then compared this performance with other processor architectures. The results of the group’s evaluation were presented in a paper at the ACM International Conference on Computing Frontiers, held May 2-6, 2006, in Ischia, Italy.
The paper, “The Potential of the Cell Processor for Scientific Computing,” was written by Samuel Williams, Leonid Oliker, Parry Husbands, Shoaib Kamil and Katherine Yelick of Berkeley Lab’s Future Technologies Group and by John Shalf from NERSC.
“Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency,” the authors wrote in their paper. “We also conclude that Cell’s heterogeneous multi-core implementation is inherently better suited to the HPC environment than homogeneous commodity multicore processors.”
Cell, designed by a partnership of Sony, Toshiba, and IBM, is a high performance implementation of software-controlled memory hierarchy in conjunction with the considerable floating point resources that are required for demanding numerical algorithms. Cell takes a radical departure from conventional multiprocessor or multi-core architectures. Instead of using identical cooperating commodity processors, it uses a conventional high performance PowerPC core that controls eight simple SIMD (single instruction, multiple data) cores, called synergistic processing elements (SPEs), where each SPE contains a synergistic processing unit (SPU), a local memory, and a memory flow controller.
Despite its radical departure from mainstream general-purpose processor design, Cell is particularly compelling because it will be produced at such high volumes that it will be cost-competitive with commodity CPUs. At the same time, the slowing pace of commodity microprocessor clock rates and increasing chip power demands have become a concern to computational scientists, encouraging the community to consider alternatives like STI Cell. The authors examined the potential of using the forthcoming STI Cell processor as a building block for future high-end parallel systems by investigating performance across several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations on regular grids, as well as 1D and 2D fast Fourier transformations.
According to the authors, the current implementation of Cell is most often noted for its extremely high performance single-precision (32-bit) floating performance, but the majority of scientific applications require double precision (64-bit). Although Cell’s peak double-precision performance is still impressive relative to its commodity peers (eight SPEs at 3.2 GHz = 14.6 Gflop/s), the group quantified how modest hardware changes, which they named Cell+, could improve double-precision performance.
The authors developed a performance model for Cell and used it to show direct comparisons of Cell with the AMD Opteron, Intel Itanium2 and Cray X1 architectures. The performance model was then used to guide implementation development that was run on IBM’s Full System Simulator in order to provide even more accurate performance estimates.
The authors argue that Cell’s three-level memory architecture, which decouples main memory accesses from computation and is explicitly managed by the software, provides several advantages over mainstream cache-based architectures. First, performance is more predictable, because the load time from an SPE’s local store is constant. Second, long block transfers from off-chip DRAM can achieve a much higher percentage of memory bandwidth than individual cache-line loads. Finally, for predictable memory access patterns, communication and computation can be effectively overlapped by careful scheduling in software.
“Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency,” the authors wrote. While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors, despite the fact that Cell’s peak double-precision performance is fourteen times slower than its peak single-precision performance. If Cell were to include at least one fully utilizable pipelined double-precision floating point unit, as proposed in their Cell+ implementation, these speedups would easily double.
The full paper can be read at http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf.
The paper was written primarily by members of LBNL’s Future Technologies Group, part of Berkeley Lab’s Computational Research Division (http://crd.lbl.gov/), which creates computational tools and techniques that enable scientific breakthroughs, by conducting applied research and development in computer science, computational science, and applied mathematics.