Forward to the Past: A Different Way to Cope with Dark Silicon

Tuesday, February 8th, 2011 by Max Baron

Leigh’s comment to whether dark silicon is a design problem or fundamental law presents an opportunity to explore an “old” processor architecture, the Ambric architecture, an architecture whose implementation made use of dark silicon but did not escape the limitations imposed on Moore’s Law by power budgets.

Mike Butts introduced the Ambric architecture at the 2006 Fall Microprocessor Forum, an event at which I served as technical content chairperson. Tom Halfhill, my colleague at the time, wrote an article about Ambric’s approach and in February 2007 Ambric won In-Stat’s 2006 Microprocessor Report Analysts’ Choice Award for Innovation.

I’ll try to describe the architecture for those that are not familiar with it.

The Ambric’s architecture’s configuration went beyond the classical MIMD definition. It was described as a globally asynchronous – locally synchronous (GALS) architecture — a description that for chip designers held connotations of clock-less processing. The description however does not detract in any way from the innovation and the award for which I voted.

The streaming Ambric architecture as I saw it at the time could be described as a heterogeneous mix of two types of processing cores plus memories and interconnect.

Ambric’s programming innovation involved software objects assigned to specific combinations of cores and/or memory whose execution could proceed in their own time and at their own clock rate– this probably being the reason for the software-defined term “asynchronous architecture.” But the cores were clocked and some could be clocked at different rates—but probably in sync to avoid metastability.

The two types of processor cores provided by Am2045 — the chip introduced at the event — were described as SRs (Streaming RISC) engaged mainly in managing communications and utilities for the second type of cores, the high performance SRDs (Streaming RISC with DSP Extensions) that were the heavy lifter cores in the architecture.

Perhaps the most important part of Ambric’s innovation was the concept of objects assigned to combinations of one or more cores that could be considered as software/hardware black boxes. The black boxes could be interconnected via registers and control that made them behave as if they were FIFOs.

I believe that this is the most important part of the innovation because it almost removes the overhead of thread synchronization. With the removal of this major obstacle to taking advantage of highly parallelizable workloads such as encountered in DSP applications, Ambric opened the architecture for execution by hundreds and possibly thousands of cores — but at the price of reduced generality and the need of more human involvement in the routing of objects on interconnects for best performance of processor cores and memory.

The Ambric architecture can cover with cores and memories a die that for example provides at a lower technology node, four times more transistors but the architecture can’t quadruple its computing speed (switchings per second) due to power budget limitations be they imposed by temperature limitations or battery capacity. Designers can only decrease the chip’s VDD to match a chip’s power dissipation to its power budget, but in doing so, they must reduce clock frequency and associated performance.

The idea of connecting “black boxes” originated with designers of analog computers and hybrid analog/digital computers at least 60 years ago. It was the approach employed in designing computers just before the introduction of the Van Neumann architecture. Ambric’s innovation that created a software/hardware combination is probably independent of the past.

Compared with Ambric’s approach, the UCSD/MIT idea is based on a number of compiler-created different efficient small cores specialized to execute short code sequences critical to the performance of the computer. The UCSD/MIT architecture can enjoy more generality in executing workloads on condition that some specific small cores must be created for the type of workloads targeted. By raising small core frequency without creating dangerous hot spots, the architecture can deliver performance yet keep within power budget boundaries – but it too, can’t deliver increased compute performance at the same rate as Moore’s law delivers transistors.

Tags:

Leave a Reply