Articles by Carolina Blanch and Rogier Baert and Maja DHondt

Carolina Blanch Perez del Notario was born in Pamplona, Spain. She received her M.S degree in Telecommunications Engineering in Spain and an M.S. in artificial intelligence from the KatholiekeUniversiteit Leuven. Since 2002 she has been working as a Research Engineer at imec, Belgium. Her research interests include video coding and transmission, cross-layer optimizations, and run-time resource management techniques. Rogier Baert was born in Vinkel, The Netherlands. He received his M.S in Electrical Engineering from TU Eindhoven. Since 2006 he has been working as a Researcher at imec, Belgium. His research interests include simulation, optimization, and design-space exploration of integrated circuits and systems. Maja D'Hondt was born in Antwerp, Belgium. She obtained a M.S in Computer Science and PhD from the VrijeUniversiteitBrussel, Belgium. Before joining imec she worked on a research project with ASML in Amsterdam and obtained a full research position at INRIA in France. Since 2007 she is a team leader in imec where her team develops middleware for embedded and high performance systems.

Implementing system-wide load balancing

Friday, May 20th, 2011 by Carolina Blanch and Rogier Baert and Maja DHondt

To enable intelligent adaptation for multiprocessing designs that perform system-wide load balancing, we implemented several functions and added these to the processors. Examples of these functions are constant resource and workload monitoring as well as bandwidth monitoring of the network link. This allows the device to promptly react to situations where (1) the workload exceeds the available resources, which can occur due to variations in the workload as well as (2) reductions of the available processing power, due to running down the batteries for example. In both cases, the processing of some tasks should be transferred to more powerful devices on the network.

In our experiment, device A and device B both implement the decoding of video streams, followed by an image analysis function (logo-detector used for quality control) and re-encoding of the video streams (See Figure 2). Depending on the resource availability at either device, some stream processing tasks are transferred seamlessly to the less overloaded device. Finally, the output of all processed streams is sent to a display client. This task migration automatically lowers the workload at the overloaded device while all videos are visualized without any artifacts.

Load balancing at the system level involves migrating tasks between devices in the network.

We implement a Device Runtime Manager (RM) at each device that takes care of the monitoring of workload, resources, video quality, and link bandwidth within each device.Note that bandwidth monitoring is required to guarantee that the processed data can be accommodated in the available bandwidth.

While the Device RM can take care of the load balancing within the device, a Network Runtime Manager is implemented to perform the load balancing between devices in the network. To do so, it receives the monitored information from each individual Device RM and decides where, on which device, to execute each task. This way, when resources at Device A are insufficient for the existing workload the Network RM shifts some stream processing tasks from Device A to B. Obviously, this task distribution between device A and B depends on the resource availability at both devices. In our experiment the screen on the client display shows in which device each displayed video stream is currently processed. In the example in Figure 2, due to lack of resources in device A, the processing of 6 video streams has been migrated to device B while the remaining 3 have been processed on device A.

In a similar way, Figure 3 shows the monitored workload on both device A and B. Around time 65s, several stream processing tasks are added to device A, causing its overload. To decrease the load of device A the Network RM gradually shifts tasks from device A to device B resulting in the load distribution given from time 80s on. This way, load balancing between devices in the network overcomes the processing limitations of device A.

The processing load distribution over time for devices A and B.

Another way to implement load balancing between devices would be by means of down-scaling the task requirements. For video decoding applications for example, this may translate into reducing the spatial or temporal resolution of the video stream. In another experimental setup, when the workload on a decoding client becomes too high, the server transcodes the video streams to a lower resolution before sending them to the client device. This naturally reduces the workload on the device client at the cost of an increased processing at the server.

Note that we chose to place the Network RM at device A, but it could be located at any other network device as the overhead involved is very low. Note also that in both experiments the decision to migrate tasks or adapt the contents is taken to fit the processing constraints and to guarantee the desired quality-of-service.

However, another motivation for load balancing can be to minimize the energy consumption at a specific device, extending its battery lifetime, or even controlling the processors’ temperature. Think of cloud computing data centers where cooling systems are critical and account for a big fraction of the energy consumed. Load balancing between servers could be used to control the processors temperature by switching on an off processors or switching tasks between them. This could potentially reduce the cooling effort and pave the way for greener and more efficient data centers.

We are facing a future where applications are becoming increasingly complex and demanding with a highly dynamic and unpredictable workload. To tackle these challenges we require high flexibility of adaptation and cooperation from devices in the network. Our research and experimentscontribute to an environment with intelligent and flexible devices that are able to optimally exploit and share their resources and those available in the network while adapting to system dynamics such as workload, resources and bandwidth.

Dynamic runtime task assignment

Tuesday, April 19th, 2011 by Carolina Blanch and Rogier Baert and Maja DHondt

Innovative multimedia applications – think of multi-camera surveillance or multipoint videoconferencing – are demanding both in processing power and bandwidth requirements. Moreover, a video processing workload is highly dynamic and subject to stringent real-time constraints. To process these types of applications, the trend is to use commodity hardware, because it provides higher flexibility and reduces the hardware cost. This hardware is often heterogeneous, a mix of central processing units (CPUs), graphic processing units (GPUs) and digital signal processors (DSPs).

But implementing these multimedia applications efficiently onto one or more heterogeneous processors is a challenge, especially if you also want to take into account the highly dynamic workload. State-of-the-art solutions tackle this challenge through fixed assignments of specific tasks to types of processors. However, such static assignments lack the flexibility to adapt to workload or resource variations and often lead to poor efficiency and over-dimensioning.

Another strategy would be to implement dynamic task assignment. To prove that this is feasible, we have implemented middleware that performs both run-time monitoring of workloads and resources, and runtime task assignment onto and between multiple heterogeneous processors. As platforms, we used the NVidia Quadro FX3700 and dual Intel quad core Xeon processors.

Load-balancing between the cores of a device

Our experiment consists of running multiple pipelines where MPEG-4 decoding, frame conversion and AVC video encoding are serialized tasks. From all these tasks, the most demanding motion estimation task, part of video encoding, can be run either on a CPU or it can be CUDA-accelerated on the GPU. Figure 1 compares our experimental runtime assignment strategy with two static assignment strategies that mimic the behavior of state-of-the-art OS-based systems.

This chart shows how the throughput increases by dynamic task assignment within a device.

The first static assignment considered consists of letting the operating system assign all tasks onto the CPU cores. The second one assigns all CUDA-accelerated tasks on the GPU while the remaining tasks are scheduled on the CPU cores. We can see how the latter one, enabling GPU-implementable versions of the most demanding tasks, increases the number of streams that can be processed from 10 to 15. However at this point, the GPU becomes the bottleneck and it limits the number of processed frames.

The last, dynamic, strategy overcomes both CPU and GPU limitations and bottlenecks by finding at runtime an optimal balance between CPU and GPU assignments. By doing so, an increased throughput of 20% is achieved in comparison with fixed assignments to GPU and CPU. This way, the efficiency and flexibility of the available hardware is increased while the overhead remains marginal, around 0.05% of the total execution time.

From the device to the cloud

Load balancing within the device is a first step that is necessary to maximize the device’s capabilities and to cope with demanding and variable workloads. But to overcome the limited processing capacity of a device, a second step is needed: load balancing at the system level. Only this will allow exploiting the potential of a highly connected environment where resources from other devices can be shared.

In addition, today’s applications tend to become more complex and demanding in both processing and bandwidth terms. On top of this, users keep demanding lighter and portable multifunctional devices where longer battery duration is desirable. Naturally, this poses serious challenges for these devices to meet the processing power required by many applications.

The way to solve this is by balancing the load between multiple devices, by offloading tasks from overloaded or processing-constrained devices to more capable ones that can process these tasks remotely. This is linked to “thin client” and “cloud computing” concepts where the limited processing resources on a device are virtually expanded by shifting processing tasks to other devices in the same network.

As an example, think of home networks or local area networks through which multiple heterogeneous devices are connected. Some of these devices are light portable devices such as I-phones and PDAs with limited processing capabilities and battery, while others are capable of higher processing such as media centers or desktops at home, or other processing nodes/servers in the network infrastructure.

One consequence from migrating tasks from lighter devices to more capable ones is that the communication and throughput between devices increases. In particular, in the case of video applications, where video is remotely decoded and transmitted, the bandwidth demand can be very high. Fortunately, upcoming wireless technologies are providing increasingly high bandwidth and connectivity enabling load balancing. This is the case of LTE femto cells where up to 100 Mbps downlink are available, or wireless HD communications systems in the 60GHz range where even 25Gbps are promised.

However, meeting the bandwidth needs is only one of the many challenges posed. Achieving efficient and flexible load balancing in the network also requires a high degree of cooperation and intelligence from devices in the network. This implies not only processing capabilities at the network side but also monitoring, decision making, and signaling capabilities.

Experimental work at imec has shown that, as challenging as it may sound, the future in which all devices in the network efficiently communicate and share resources is much closer than we think.