======================================================================== 2009 International Workshop on Multi-Core Computing Systems (MuCoCoS'09) Fukuoka, Japan, March 16-19, 2009, in conjunction with CISIS'09 ======================================================================== PROGRAM Monday, March 16, 2009 ======================================================================== [11:00-12:10] SESSION MUCOCOS-S1 Session Chair: Sabri Pllana, University of Vienna, Austria ------------------------------------------------------------------------ [11:00-11:10] Opening remarks Sabri Pllana, University of Vienna, Austria [11:10-11:40] Evaluating the run-time performance of Kahn process network implementation techniques on shared-memory multiprocessors. Zeljko Vrba, Paal Halvorsen, Carsten Griwodz ABSTRACT. Software development tools have not adapted to the growing popularity of multi-core CPUs, and develop- ers are still “stuck” with low-level and high-cost thread abstractions. The situation is becoming even more complicated with the advent of heterogenuous comput- ing. In this article, we point out some drawbacks of threads and other abstractions currently in use, and propose Kahn process networks (KPN) as a more high- level and efficient abstraction for developing parallel applications. We show that the native POSIX mech- anisms (threads and message queues) perform subop- timally as an implementation vehicle for KPNs, and we present an implementation of a run-time environ- ment that can execute KPNs with less overhead. Our evaluation shows the advantages and disadvantages of statically mapping Kahn processes to CPUs. [11:40-12:10] Experimental Study of Multithreading to Improve Memory Hierarchy Performance of Multi-core Processors for Scientific Applications. Enes Bajrovic, Eduard Mehofer ABSTRACT. In this paper we study performance characteristics and parallelization strategies for recently shipped, powerful multicore processors—IBM Power6 and Sun T2 Plus— for highend scientific computing. Central aspect is data locality. First, we investigate the impacts of good and bad data locality by modifying data accesses. Next, we study the impact of multithreading with respect to data locality based on the data-parallel programming approach. The level of parallelism is increased by assigning multiple threads onto one core in order to hide processor stalls caused by bad data locality. We measure the impacts of data locality and multithreading in terms of execution times and bandwidth for a synthetic micro-benchmark, a matrix multiplication kernel, and an application from Bioinformatics. The results indicate that substantial performance improvements can be obtained with minor effort by utilizing multithreading. [12:10-14:00] LUNCH BREAK ------------------------------------------------------------------------ [14:00-15:30] SESSION MUCOCOS-S2 Session Chair: Fatos Xhafa, Technical University of Catalonia, Spain ------------------------------------------------------------------------ [14:00-14:30] PaSTeL: Parallel Runtime and Algorithms for Small Datasets. Brice Videau, Erik Saule, Jean-François Méhaut ABSTRACT. In this paper, we put forward PaSTeL [1], an engine dedicated to parallel algorithms. PaSTeL offers both a programming model, to build parallel algorithms and an execution model based on work-stealing. Special care has been taken on using optimized thread activation and synchronization mechanisms. In order to illustrate the use of PaSTeL a subset of the STL’s algorithms was implemented, which were also used on performance experiments. PaSTeL’s performance is evaluated on a laptop computer using two cores, but also on a 16 cores platform. PaSTeL shows better performance than other implementations of the STL, especially on small datasets. [14:30-15:00] Introducing hardware TLP support in the Cell processor. Roberto Giorgi, Zdravko Popovic, Nikola Puzovic ABSTRACT. The focus of our study is the support for fine/medium grained Thread Level Parallelism (TLP) by using a hardware scheduling unit and relying on existing simple cores. Simple cores are grouped into clusters in order to provide a scalable solution. As a proof of concept we use an implementation based on the Cell Broadband Engine (CBE). Cell is a multiprocessor on a chip developed by Sony, Toshiba and IBM that contains one general purpose core and eight coprocessor elements that accelerate the multimedia and vector processing. The aim of this paper is to present a possible implementation of DTA that is based on the Cell processor while keeping scalability as the original DTA infrastructure. [15:00-15:30] A Multipurpose Clustering Algorithm for Task Partitioning in Multicore Embedded Reconfigurable Systems. S. Arash Ostadzadeh, Roel J. Meeuws, Kamana Sidgel, Koen Bertels ABSTRACT. Recently, multicore systems have become a dominant architecture and this demands addressing new challenges in order to take full advantage of their efficiency. Reconfigurable computing has also received a great deal of attention due to its ability to increase the performance of an application through hardware execution, while possessing the flexibility of software solution. Grouping tasks within an application contributes to coarse-grained partitioning, which can eventually improve the performance of the system. In this paper, we introduce a clustering framework along with a flexible multipurpose clustering algorithm which initiates task clustering at the functional level based on dynamic profiling information. The clustering framework can be used as the basic step to modify the granularity of tasks in the hardware/software partitioning and scheduling phases. As a result, an elaborate mapping onto the system resources and possibly a higher degree of task parallelism can be obtained. Here the framework particularly addresses two primary objectives to form workload-balanced and loosely-coupled clusters. To evaluate its efficiency, we used an MJPEG application as a case study. The experimental results comply with the desired clustering metrics, which were defined through the objectives. [15:30-16:00] COFFEE BREAK ------------------------------------------------------------------------ [16:00-17:30] SESSION MUCOCOS-S3 Session Chair: Eduard Mehofer, University of Vienna, Austria ------------------------------------------------------------------------ [16:00-16:30] Optimistic Parallel Discrete Event Simulation Based on Multi-core Platform and its Performance Analysis. Nianle Su, Hongtao Hou, Feng Yang, Qun Li, Weiping Wang ABSTRACT. The development of computer processor has stepped into the era of multi-core, providing a good chance to spread the parallel discrete event simulation. The parallel programming model and synchronization problem during the parallelization of discrete event simulation on multi-core platform were discussed. A parallel discrete event simulator based on multi-core platform was designed and implemented using the optimistic synchronization algorithm. On an HP multi-core server with up to 8 cores, both the overheads of the parallel simulator and the effects of event granularity, process number, lookahead, event locality on the simulation performance were tested using the Phold model. The experiment results show that the optimistic parallel discrete event simulation based on multi-core platform could achieve good speedup for simulation applications with coarse-grained events. [16:30-17:00] Efficient use of processing cores on heterogeneous multicore architecture. Fabien Calcado, Stephane Louise, Vincent David, Alain Mérigot ABSTRACT. One of the major challenges of multicore architectures is not only to aim toward high performances but also to efficiently harness the computing power of these systems. This is especially true for embedded systems where problems of energy and silicon efficiencies are critical. Multicore architectures provide significant gains for explicitly multi-threaded or dataflow applications. However, single-task applications commonly found in embedded systems do not fit well with nowadays multicore architectures. To maximize performances and efficiency of the chip, communication and allocation-synchronization problems need to be addressed in concert with a coherent and carefully crafted approach of the programming interface. This paper focuses on allocation-synchronization problems and its programming interface on these architectures. The proposed mechanism is based on an intermediate level of parallelism and provides a solution to allocate and synchronize processing cores with an easy to use instruction set architecture. It avoids global synchronization of cores when interruptions or exceptions occur on the main processor. This increases core utilizations among all applications executed on the chip and thus, chip efficiency. A preliminary evaluation has shown significant improvements in terms of performance, energy and silicon efficiencies of the chip. [17:00-17:30] Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints. Francisco Gilabert, Daniele Ludovici, Simone Medardoni, Davide Bertozzi, Luca Benini, Georgi Gaydadjiev ABSTRACT. Regular multi-core processors are appearing in the embedded system market as high performance software programmable solutions. The use of regular interconnect fabrics for them allows fast design time, ease of routing, predictability of electrical parameters and good scalability. k-ary n-mesh topologies are candidate solutions for these systems, borrowed from the domain of off-chip interconnection networks. However, the on-chip integration has to deal with unique challenges at different levels of abstraction. From a technology viewpoint, interconnect reverse scaling causes critical paths to go across global links. Poor interconnect performance might also impact IP core speed depending on the synchronization mechanism at the interface. Finally, this might also conflict with the requirements that communication libraries employed in the MPSoC domain pose on the underlying interconnect fabric. This paper provides a comprehensive overview of these topics, by characterizing physical feasibility of representative k-ary n-mesh topologies and by providing silicon-aware system-level performance figures. ======================================================================== Workshop Proceedings are Published by the IEEE Computer Society Press ========================================================================