Green Flash Project Runs Logical Prototype Successfully
May 11, 2010
The Green Flash project, which is exploring the feasibility of building a new class of energy-efficient supercomputers for climate modeling, has successfully reached its first milestone by running the dynamical core of the Global Cloud Resolving Model (GCRM) on logical prototypes of both single- and dual-core Green Flash processors, with eight-core processors coming soon.
“The logical prototype simulates the entire circuit design of the proposed processor,” says John Shalf, head of NERSC’s Advanced Technologies Group and principal investigator of Green Flash.
The prototype was designed in collaboration with Tensilica, Inc., using Tensilica’s Xtensa LX-2 extensible processor core as the basic building block, and has been running cycle accurate hardware emulations of the circuit design on a BEE3 FPGA platform, which is used for computer architecture research by the RAMP Consortium (Research Accelerator for Multi-Processors). A next-generation, limited area model version of GCRM has been used as the test code.
David Donofrio of Berkeley Lab’s Computational Research Division (CRD), who works on the hardware design of Green Flash, ran the first single-core prototype in a demonstration at the SC08 conference in Austin, Texas in November 2008. This was followed by a multiprocessor demo at the SC09 conference. Green Flash was first proposed publicly in the paper “Towards Ultra-High Resolution Models of Climate and Weather,” written by Michael Wehner and Lenny Oliker of CRD and Shalf of NERSC.
One solution for three problems
The Green Flash project addresses three research problems simultaneously — a climate science problem, a computer architecture/hardware problem, and a software problem. This multidisciplinary development process is commonly referred to as hardware/software co-design.
The climate science problem stems from the resolution of current climate models, which is too coarse to directly calculate the behavior of cumulus convective cloud systems (see “Bringing Clouds into Focus,” page xx). Direct numerical simulation of individual clouds systems would require horizontal grid resolutions approaching 1 km. To develop a 1 km cloud model, scientists would need a supercomputer that is 1,000 times more powerful than what is available today.
But building a supercomputer that powerful with conventional microprocessors (the kind used to build personal computers) would cost about $1 billion and would require 200 megawatts of electricity to operate — enough energy to power a small city of 100,000 residents. That constitutes the computer architecture problem. In fact, the energy consumption of conventional computers is now recognized as a major problem not just for climate science, but for all of computing, from cell phones to the largest scale systems).
Shalf, Wehner, and Oliker see a possible solution to these challenges — achieving high performance with a limited power budget and with economic viability — in the low-power embedded microprocessors found in cell phones, iPods, and other electronic devices. Unlike the general-purpose processors found in personal computers and most supercomputers, where versatility comes at a high cost in power consumption and heat generation, embedded processors are designed to perform only what is required for specific applications, so their power needs are much lower. The embedded processor market also offers a robust set of design tools and a well-established economic model for developing application-specific integrated circuits (ASICs) that achieve power efficiency by tailoring the design to the requirements of the application. Chuck McParland of CRD has been examining issues of manufacturability and cost projections for the Green Flash design to demonstrate the cost-effectiveness of this approach.
Meeting the performance target for the climate model using this technology approach will require on the order of 20 million processors. Conventional approaches to programming are unable to scale to such massive concurrency. The software problem addressed by the Green Flash project involves developing new programming models that are designed with million-way concurrency in mind, and exploiting auto-tuning technology to automate the optimization of the software design to operate efficiently on such a massively parallel system.
To meet this challenge, Tony Drummond of CRD and Norm Miller of the Earth Sciences Division are working on analyzing the code requirements; and Shoaib Kamil, a graduate student in computer science at the University of California, Berkeley (UCB) who is working at NERSC, has been developing an auto-tuning framework for the climate code. This framework automatically extracts sections of the Fortran source code of the climate model and optimizes them for Green Flash and a variety of other architectures, including multicore processors and graphics processors.
An innovative aspect of the Green Flash research is the hardware/software co-design process, in which early versions of both the processor design and the application code are developed and tested simultaneously. The RAMP emulation platform allows scientists to run the climate code on different hardware configurations and evaluate those designs while they are still on the drawing board. Members of the RAMP consortium on the UC Berkeley campus, including John Wawrzyneck and Krste Asanovic (both of whom have joint appointments at NERSC), Greg Gibling, and Dan Burke, have been working closely with David Donofrio of NERSC and the Green Flash hardware team throughout the development process. A RAMP test at UCB has successfully emulated more than 1,000 cores.
At the same time, auto-tuning tools for code generation test different software implementations on each hardware configuration to increase performance, scalability, and power efficiency. Marghoob Mohiyuddin, another UCB graduate student at NERSC, has been working on automating the hardware/software co-design process. With a dual-core processor configuration on the RAMP emulator, Mohiyuddin can test more than 200 configurations in one day, which is 125 times faster than conventional approaches to design space exploration. The result will be a combination of hardware and software optimized to solve the cloud modeling problem.
The researchers estimate that the proposed Green Flash supercomputer, using about 20 million embedded microprocessors, would deliver the 1 km cloud model results and cost perhaps $75 million to construct (a more precise figure is one of the project goals). This computer would consume less than 4 megawatts of power and achieve a peak performance of 200 petaflops.
Maximizing efficiency
![]() |
| Figure 1. The on-chip network fabric for the Green Flash system-on-chip. A concentrated torus network fabric yields the highest performance and most power-efficient design for scientific codes. |
Following the design philosophy that the best way to reduce power consumption and increase efficiency is to reduce waste, the Green Flash team chose an architecture with a very simple in-order core and no branch prediction. Because the climate model’s demands for memory and communication are high, both aspects drive the core design, which includes a local store to maximize use of the available dynamic RAM (DRAM) bandwidth.
As Figure 1 shows, the design uses a torus network fabric with two on-chip networks. Most of the communication among the climate model’s subdomains is nearest neighbor, and experiments showed that a concentrated torus topology provides superior performance and energy efficiency for codes in which a nearest-neighbor communication pattern dominates. The researchers are currently targeting a core with a clock speed of 500 MHz, a 32-Kbyte conventional error correction code (ECC)-protected cache per core, and a 128-Kbyte local store. The availability of a conventional cache will allow code to be incrementally ported to use the local store. Each socket of 128 cores will have a 50-Gbyte-per-second interface to DRAM.
Achieving the target execution rate on 20 million processors requires computing on a local mesh size that is 8 × 8 × 10 cells. If the code were to run on conventional cache-based hardware, it would spend 90 percent of its time in communication due to the overhead penalty of exchanging extremely small messages between cores. But Green Flash has added specialized hardware to each core to enable extremely low-overhead messaging between cores, bringing the communication overhead below 20 percent of the total execution time. This ultra-low-overhead streaming interface bypasses the cache to minimize latency and connects to one of the on-chip torus networks. The narrow network is for address exchange; the wider torus network is for bulk data exchange using asynchronous direct memory access (DMA) data transfers. The address space for each processor’s local store is mapped into the global address space, and the data exchange is done as a DMA from local store to local store.
From a logical programming view, all processors are directly connected to each other, but physically they are connected using a concentrated torus network to the chip’s 2D planar geometry. To further simplify programming, a traditional cache hierarchy is also in place to allow the slow porting of codes to the more efficient interprocessor network.
![]() |
| Figure 2. Photonic switching elements. (1) Light is coupled onto a perpendicular path; (2) messages propagate straight through. The lack of distance and complex structures are strong advantages over a purely electrical interconnect. |
To minimize power, the researchers are investigating the use of hybrid electronic-photonic interconnects for the inter-core network, which could prove to be an efficient way of transferring long messages. Designers place photonic detectors and emitters along with specialized low-power photonic switching elements on a special interconnect layer and interface them with processing elements using conventional electronic routers. Figure 2 shows how the switching elements work. Large-scale communications occur over photonic links, which have several strong advantages over electronic networks. Energy consumption for photonics is less dependent on signaling rate and distance compared to electronics, and the photonic switches are much simpler, as they do not require buffers or repeaters.
Preliminary research with messaging patterns from scientific applications shows that such hybrid networks have the potential to bring major gains in efficiency, due to their lower power consumption combined with fast propagation speed. Early research studies done in collaboration with the Lightwave Research Laboratory at Columbia University, for example, show that a hybrid electronic-photonic interconnect composed of ring resonators can deliver 27 times better energy efficiency than electrical interconnects alone.
To optimize the code and reduce the computational burden, the Green Flash team created an autotuning framework that automatically searches a range of optimizations to improve the application kernels’ computational efficiency. The autotuner first systematically applies compiler optimizations and then uses domain-specific knowledge of the algorithm to take more aggressive steps, such as loop reordering, to produce optimal but functionally equivalent code. In this way, it maintains performance across a diverse set of architectures.
![]() |
| Figure 3. Effect of optimization on a single loop in the climate model. In addition to greatly reducing the instruction count, optimization reduced the cache footprint of this loop by more than 100 times. With software tuning, Green Flash can reduce a per-core computational requirement of 3.5 gigaflops to a more feasible 0.5 gigaflops. |
Figure 3 shows the autotuning results for the climate model. The researchers ran the autotuning framework using the Tensilica architectural simulator, reducing the cache footprint and overall instruction count and increasing the kernel’s computational density. They first generated the original requirement of 3.5 gigaflops per core using a machine that ran with approximately 5 percent efficiency. Autotuners, combined with hardware optimizations, will play a key role in dramatically increasing the efficiency of Green Flash. Through these combined optimizations, Green Flash is expected to realize a two-orders-of-magnitude increase in efficiency.
The hardware-software codesign method tailors the hardware to autotuned software to get better energy efficiency. The autotuning technology can automate the exploration for the optimal combination of tuned software and hardware in a coordinated design cycle. As Figure 4 shows, this cotuning approach incorporates extensive software tuning into the hardware design process. The autotuned software tailors the application to the hardware design point under consideration by empirically searching software implementations to find the best mapping of software to microarchitecture.

Figure 4. Cotuning in the Green Flash design. (a) Conventional autotuning uses source code generators and search heuristics to empirically choose an efficient software implementation given a high-level representation of a kernel. (b) Hardware-software cotuning extends conventional hardware design space exploration by using autotuning to tailor software to each hardware design point.
![]() |
| Figure 5. The advantages of cotuning for three kernel types common in scientific applications. AE and PE points denote configurations with highest area and power efficiencies. Improvements varied from 2x to 50x. |
As a demonstration of this cotuning methodology, the Green Flash team used the Smart Memories multiprocessor (based on Tensilica cores) as the target architecture and three widely used kernels from scientific computing: dense matrix-matrix multiplication, stencil codes, and sparse matrix vector multiplication. As part of exploring the hardware design space, they varied four hardware parameters: number of cores, whether caches are managed by hardware or software, cache size per core, and total memory bandwidth available. They estimated the area and power of each hardware configuration that had the corresponding best software configuration, which they obtained through autotuning. As Figure 5 shows, power and area efficiencies improved dramatically for the three kernels.
The hardware-software codesign process enables scientific application developers to directly participate in the design process for future supercomputers in an unprecedented way. With this fast, accurate emulation environment, designers can run and benchmark the actual climate model as it is being developed and use cotuning to quickly search a large design space.
Scaling up
In considering any system of this scale, a myriad of system software issues come to the forefront, such as scalable operating systems, fault resilience infrastructure, and the development of entirely new programming models to make billion-way parallelism more tractable.
Although the fault resilience problem is certainly not trivial, neither is it unusual. Across silicon design processes with the same design rules, hard failure rates are proportional to the number of system sockets and typically stem from mechanical failures. Soft error rates are proportional to the chip surface area, not how many cores are on a chip. And bit error rates tend to increase with clock rate. The Green Flash architecture is unremarkable in all these respects and should not pose challenges beyond those that a conventional approach faces.
To deal with hard errors, designers often add redundant cores per chip to cover defects,. a strategy that is entirely feasible for the Green Flash design. Moreover, Green Flash’s low power dissipation per chip (7 to 15 W) will reduce the mechanical and thermal stresses that often result in a hard error. To address soft errors, the design includes all the basics for reliability and error recovery in the memory subsystem, including full ECC protection for all hierarchical levels. Green Flash’s low target clock frequency provides a lower signal-to-noise ratio for on-chip data transfers. Finally, to enable faster rollback if an error does occur, the design makes it possible to incorporate a nonvolatile RAM controller onto each SMP so that each node can perform a local rollback as needed. This strategy enables much faster rollback than user-space checkpointing.
The researchers are also exploring novel programming models together with hardware support to express fine-grained parallelism. The goal of this development thrust is to create a new software model that can provide a stable platform for software development for the next decade and beyond for all scales of scientific computing.
They have developed direct hardware support for both the message passing interface (MPI) and partitioned global address space (PGAS) programming models to enable scaling of these familiar single program, multiple data (SPMD) programming styles to much larger-scale systems. The modest hardware support enables relatively well-known programming paradigms to utilize massive on-chip concurrency and to use hierarchical parallelism to enable use of larger messages for interchip communication. GCRM’s icosahedral formulation of the climate problem can expose a massive degree of parallelism through domain decomposition, which can use a 20-million processor computing system. The autotuning framework is rapidly evolving into a generalized code generator, which allows the programmer to express the solver kernels at a much higher level of abstraction — enabling a productive programming environment that supports portability, performance, and correctness without exposing scientists to the details of the computer architecture. This approach is expected to support a broad range of codes that have such inherent explicit parallelism.
However, not all applications will be able to express parallelism through simple divide-and-conquer problem partitioning. The Green Flash team is beginning to explore new asymmetric and asynchronous approaches to achieving strong-scaling performance improvements from explicit parallelism. Techniques that resemble class static dataflow methods are garnering renewed interest because of their ability to flexibly schedule work and to accommodate state migration to correct load imbalances and failures.
In the case of the GCRM climate code, dataflow techniques can be used to concurrently schedule the physics computations with the dynamic core of the climate code, thereby doubling concurrency without moving to a finer domain decomposition. This approach also benefits from the unique interprocessor communication interfaces developed for Green Flash. Successful demonstration of the new parallelization procedure for a range of leading extreme-scale applications can then be utilized by other similar codes, accelerating development efforts for the entire field.
Designs that follow the Green Flash approach have the potential to open a market demand for massively concurrent components that can also be the building blocks for mid- and extreme-scale computing systems. “We believe that our decision to draw from the embedded computing industry will produce technology that reduces economic and manufacturing barriers to constructing computing systems useful to science,” Shalf says. “It will also ensure that selected technologies have broad market impact for everything from the smallest handheld to the largest supercomputer. The investment will thus be the center of a sustainable software-hardware universe supported by applications across the IT industry.”
The Green Flash prototype research is funded by Berkeley Lab’s Laboratory Directed Research and Development program.




