Publications
PEPPHER-related publications of the Consortium in the year 2012:
-
Christoph Kessler, Welf Löwe. Optimized composition of performance-aware parallel components. Proc. CPC-2010 15th Int. Workshop on Compilers for Parallel Computers, Vienna, Austria, July 2010.
Abstract:
We describe the principles of a novel framework for performanceaware composition of sequential and explicitly parallel software components with implementation variants. Automatic composition results in a table-driven implementation that, for each parallel call of a performance-aware component, looks up the expected best implementation variant, processor allocation and schedule given the current problem and processor group sizes. The dispatch tables are computed off-line at component deployment time by interleaved dynamic programming algorithm from time-prediction metacode provided by the component supplier. -
Daniel Cederman and Philippas Tsigas. Supporting lock-free composition of concurrent data objects. In the Proceedings of the 2010 ACM International Conference on Computing Frontiers (CF 2010).
Abstract:
Lock-free data objects offer several advantages over their blocking counterparts, such as being immune to deadlocks and convoying and, more importantly, being highly concurrent. However, composing the operations they provide into larger atomic operations, while still guaranteeing efficiency and lock-freedom, is a challenging algorithmic task.
We present a lock-free methodology for composing highly concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free move operations to a wide range of concurrent objects. Experimental evaluation has shown that the operations originally supported by the data objects keep their performance behavior under our methodology. -
Daniel Cederman, Philippas Tsigas, Muhammad Tayyab Chaudhry. Towards a Software Transactional Memory for Graphics Processors. In the Proceedings of the 10th Eurographics Symposium on Parallel Graphics and Visualization (EGPGV 2010).
Abstract:
The introduction of general purpose computing on many-core graphics processor systems, and the general shift in the industry towards parallelism, has created a demand for ease of parallelization. Software transactional memory (STM) simplifies development of concurrent code by allowing the programmer to mark sections of code to be executed concurrently and atomically in an optimistic manner. In contrast to locks, STMs are easy to compose and do not suffer from deadlocks. We have designed and implemented two STMs for graphics processors, one blocking and one non-blocking. The design issues involved in the designing of these two STMs are described and explained in the paper together with experimental results comparing the performance of the two STMs. -
Anders Gidenstam, Håkan Sundell and Philippas Tsigas. Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency. In the Proceedings of the 14th International Conference on Principle of Distributed Systems (OPODIS 2010), Lecture Notes in Computer Science Vol.: 6490, Springer-Verlag 2010.
Abstract:
A lock-free FIFO queue data structure is presented in this paper. The algorithm supports multiple producers and multiple consumers and weak memory models. It has been designed to be cache-aware and work directly on weak memory models. It utilizes the cache behavior in concert with lazy updates of shared data, and a dynamic lock-free memory management scheme to decrease unnecessary synchronization and increase performance. Experiments on an 8-way multi-core platform show significantly better performance for the new algorithm compared to previous fast lock-free algorithms. -
Daniel Cederman and Philippas Tsigas. Supporting Lock-Free Composition of Concurrent Data Objects. In the Proceedings of the 3rd Swedish Workshop on Multi-Core Computing (MCC 2010).
Abstract:
We present a lock-free methodology for composing highly concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free move operations to a wide range of concurrent objects. Experimental evaluation has shown that the operations originally supported by the data objects keep their performance behavior under our methodology. -
Rafia Inam, Daniel Cederman and Philippas Tsigas. A* Algorithm for Graphics Processors. In the Proceedings of the 3rd Swedish Workshop on Multi-Core Computing (MCC 2010).
Abstract:
Today's computer games have thousands of agents moving at the same time in areas inhabited by a large number of obstacles. In such an environment it is important to be able to calculate multiple shortest paths concurrently in an efficient manner. The highly parallel nature of the graphics processor suits this scenario perfectly. We have implemented a graphics processor based version of the A* path finding algorithm together with three algorithmic improvements that allow it to work faster and on bigger maps. The first makes use of pre-calculated paths for commonly used paths. The second use multiple threads that work concurrently on the same path. The third improvement makes use of a scheme that hierarchically breaks down large search spaces. In the latter the algorithm first calculates the path on a high level abstraction of the map, lowering the amount of nodes that needs to be visited. This algorithmic technique makes it possible to calculate more paths concurrently on large map settings compared to what was possible using the standard A* algorithm. Experimental results comparing the efficiency of the algorithmic techniques on a NVIDIA GeForce GTX 260 with 24 multi-processors are also presented in the paper. -
Anders Gidenstam, Håkan Sundell and Philippas Tsigas. Efficient Lock-Free Queues that Mind the Cache. In the Proceedings of the 3rd Swedish Workshop on Multi-Core Computing (MCC 2010).
Abstract:
This paper discusses a lock-free FIFO queue data structure that is presented in [ 3 ]. The algorithm supports multiple producers and multiple consumers and weak memory models. It has been designed to be cache-aware and work directly on weak memory models. It utilizes the cache behavior in concert with lazy updates of shared data, and a dynamic lock-free memory management scheme to decrease unnecessary synchronization and increase performance. Experiments on an 8-way multi-core platform show significantly better performance for the algorithm compared to previous fast lock-free queue algorithms. -
Pete Cooper, Uwe Dolinsky, Alastair F. Donaldson, Andrew Richards, Colin Riley, George Russell. Offload - Automating Code Migration to Heterogeneous Multicore Systems. In Proceedings of the 5th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC'10), Lecture Notes in Computer Science 5952, pages 337-352. Springer, 2010.
Abstract:
We present Offload, programming model for offloading parts of a C++ application to run on accelerator cores in a heterogeneous multicore system. Code to be offloaded is enclosed in an offload scope; all functions called indirectly from an offload scope are compiled for the accelerator cores. Data defined inside/outside an offload scope resides in accelerator/host memory respectively, and code to move data between memory spaces is generated automatically by the compiler. This is achieved by distinguishing between host and accelerator pointers at the type level, and compiling multiple versions of functions based on pointer parameter configurations using automatic call-graph duplication. We discuss solutions to several challenging issues related to call-graph duplication, and present an implementation of Offload for the Cell BE processor, evaluated using a number of benchmarks. -
Alastair F. Donaldson, Uwe Dolinsky, Andrew Richards, George Russell. Automatic Offloading of C++ for the Cell BE Processor: a Case Study Using Offload. In Proceedings of the 2010 International Workshop on Multi-Core Computing Systems (MuCoCoS'10), pages 901-906. IEEE Computer Society, 2010.
Abstract:
Offload C++ is an extended version of the C++ language, together with a compiler and runtime system, for automatically offloading general-purpose C++ code to run on the Synergistic Processor Elements (SPEs) of the Cell Broadband Engine (BE) processor. We introduce Offload C++ by presenting a case study using the approach to offload parts of an image processing application. The case study introduces the language extensions; illustrates the core technology on which the technique is based: automatic call-graph duplication, and automatic generation of data-movement code; shows how parallelism can be achieved by offloading work to multiple SPEs simultaneously, while the Power Processor Element (PPE) core simultaneously performs additional work; and demonstrates our solutions to dealing with complex language features such as function pointers and multiple compilation units. -
Cédric Augonnet, Jérôme Clet-Ortega, Samuel Thibault, Raymond Namyst. Data-Aware Task Scheduling on Multi-Accelerator based Platforms, International Conference on Parallel and Distributed Systems 2010.
Abstract:
To fully tap into the potential of heterogeneous machines composed of multicore processors and multiple accelerators, simple offloading approaches in which the main trunk of the application runs on regular cores while only specific parts are offloaded on accelerators are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we previously proposed StarPU, a runtime system capable of scheduling tasks over multicore machines equipped with GPU accelerators. StarPU uses a software virtual shared memory (VSM) that provides a high-level programming interface and automates data transfers between processing units so as to enable a dynamic scheduling of tasks. We now present how we have extended StarPU to minimize the cost of transfers between processing units in order to efficiently cope with multi-GPU hardware configurations. To this end, our runtime system implements data prefetching based on asynchronous data transfers, and uses data transfer cost prediction to influence the decisions taken by the task scheduler. We demonstrate the relevance of our approach by benchmarking two parallel numerical algorithms using our runtime system. We obtain significant speedups and high efficiency over multicore machines equipped with multiple accelerators. We also evaluate the behaviour of these applications over clusters featuring multiple GPUs per node, showing how our runtime system can combine with MPI. -
Rikard Hulten, Christoph W. Kessler, Jörg Keller: Optimized on-chip-pipelined mergesort on the Cell/B.E. Proc. EuroPar?-2010, August 2010, Springer LNCS.
Abstract:
Limited bandwidth to off-chip main memory is a performance bottleneck in chip multiprocessors for streaming computations, such as Cell/B.E., and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization. In earlier work, we have proposed such a transformation technique: on-chip pipelining. On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip bus, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead. In this paper, we consider parallel mergesort on Cell/B.E. as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique for the Cell processor, describe our implementation, evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 61% on a QS-20 and up to 143% on a PS3, compared to the merge phase of CellSort??, which was by now the fastest merge sort implementation on Cell. -
Johan Enmyren, Christoph W. Kessler: SkePU: A Multi-Backend Skeleton Programming Library for Multi-GPU Systems. Proc. HLPP-2010, September 2010, Baltimore, USA. ACM.
Abstract:
We present SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU systems. Copying data between the host and the GPU device memory can be a performance bottleneck. A key technique in SkePU is the implementation of lazy memory copying in the container type used to represent skeleton operands, which allows to avoid unnecessary memory transfers. We evaluate SkePU with small benchmarks and a larger application, a Runge-Kutta ODE solver. The results show that a skeleton approach to GPU programming does not necessarily loose that much in performance compared to a specialized library. It also shows that utilizing several GPUs can, even for simple tasks, give some performance boost for sufficiently large inputs. We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to 10 times faster run times when using SkePU with a GPU backend compared to a sequential solver running on a fast CPU. -
Johan Enmyren, Usman Dastgeer, Christoph W. Kessler: Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems. Proc. MCC-2010, November 2010, Gothenburg, Sweden.
Abstract:
SkePU is a C++ template library that provides a simple and unied interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU systems. Currently available skeletons in SkePU include map, reduce, mapreduce, map-with-overlap, maparray, and scan. The performance of SkePU generated code is comparable to that of hand-written code, even for more complex applications such as ODE solving. In this paper, we describe how to make SkePU tunable, by adding the mechanism of execution plans that can configure a skeleton so that, at run time, the predicted best suitable resource and platform is chosen automatically, depending on operand data sizes. We also discuss how the approach can be extended to provide a fully auto-tunable skeleton programming system, which is a work in progress. -
Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Hatem Ltaief, Raymond Namyst, Jean Roman, Samuel Thibault, Stanimire Tomov. Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators., Symposium on Application Accelerators in High Performance Computing (SAAHPC).
Abstract:
Although the hardware has dramatically changed in the last few years, nodes of multicore chips augmented by Graphics Processing Units (GPUs) seem to be a trend of major importance. Previous approaches for scheduling dense linear operations on such a complex node led to high performance but at the double cost of not using the potential of all the cores and producing a static and non generic code. In this extended abstract, we present a new approach for scheduling dense linear algebra operations on multicore architectures with GPU accelerators using a dynamic scheduler capable of using the full potential of the node [1]. We underline the benefits both in terms of programmability and performance. We illustrate our approach with a Cholesky factorization relying on cutting edge GPU and CPU kernels [2], [3] achieving roughly 900 Gflop/s on an eight cores node accelerated with three NVIDIA Tesla GPUs.
PEPPHER-related publications of the Consortium in the year 2012:
-
Cédric Augonnet, Samuel Thibault, Raymond Namyst, Pierre-André Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience 2010.
Abstract:
In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run-time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly-optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. -
Sabri Pllana et al. "European Multicore Processing Projects," IEEE Micro, vol. 30, no. 5, pp. 98-101, Sep./Oct. 2010.
Abstract:
Not Available.
Public PEPPHER deliverables in the year 2012:
-
D1.1 PEPPHER Component Model.
-
D3.1 Auto-tuned Sorting for GPUs and Lock-Free Composition of Data Objects.
-
D7.1 The PEPPHER Website.
-
D7.2 Dissemination activities and results in first project year.
PEPPHER-related publications of the Consortium in the year 2012:
-
Usman Dastgeer, Johan Enmyren, Christoph Kessler. Auto-tuning SkePU: A Multi-Backend Skeleton Programming Framework for Multi-GPU Systems. Proc. IWMSE-2011, Hawaii, USA, May 2011, ACM.
Abstract:
SkePU is a C++ template library that provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP backend. It also supports multi-GPU systems. Currently available skeletons in SkePU include map, reduce, mapreduce, map-with-overlap, maparray, and scan. The performance of SkePU generated code is comparable to that of hand-written code, even for more complex applications such as ODE solving. In this paper, we discuss initial results from auto-tuning SkePU using an off-line, machine learning approach where we adapt skeletons to a given platform using training data. The prediction mechanism at execution time uses off-line pre-calculated estimates to construct an execution plan for any desired configuration with minimal overhead. The prediction mechanism accurately predicts execution time for repetitive executions and includes a mechanism to predict execution time for user functions of dierent complexity. The tuning framework covers selection between dierent backends as well as choosing optimal parameter values for the selected backend. We will discuss our approach and initial results obtained for different skeletons (map, mapreduce, reduce). -
Antonina Danylenko, Welf Löwe, Christoph Kessler. Comparing Machine Learning Approaches for Context-Aware Composition. In Proc. 10th Int. Conf. on Software Composition, 30 June – 1 July, 2011, Zürich, Switzerland.
Abstract:
Context-Aware Composition allows to automatically select optimal variants of algorithms, data-structures, and schedules at runtime using generalized dynamic Dispatch Tables. These tables grow exponentially with the number of significant context attributes. To make Context-Aware Composition scale, we suggest four alternative implementations to Dispatch Tables, all well-known in the field of machine learning: Decision Trees, Decision Diagrams, Naive Bayes and Support Vector Machines classifiers. We assess their decision overhead and memory consumption theoretically and practically in a number of experiments on different hardware platforms. Decision Diagrams turn out to be more compact compared to Dispatch Tables, almost as accurate, and faster in decision making. Using Decision Diagrams in Context-Aware Composition leads to a better scalability, i.e., Context-Aware Composition can be applied at more program points and regard more context attributes than before. -
Martin Wimmer and Jesper Larsson Träff. A work-stealing framework for mixed-mode parallel applications, MTAAP Workshop, IEEE International Parallel and Distributed Processing Symposium (IPDPS), Alaska, USA, May 2011.
Abstract:
Parallelizing complex applications even for well-behaved parallel systems often calls for different parallelization approaches within the same application. In this paper we discuss three applications from the literature that for both reasons of efficiency and expressive convenience benefit from a mixture of task and more tightly coupled data parallelism. These three applications, namely Quicksort, list ranking, and LU factorization with partial pivoting, are paradigms for recursive, \emph{mixed-mode parallel algorithms} that can neither easily nor efficiently be expressed in either a purely data-parallel or a purely task-parallel fashion. As a solution we present a shared-memory programming framework that allows tasks to dynamically spawn subtasks with a given degree of parallelism for implementing tightly coupled parallel parts of the algorithm. All three paradigmatic applications can naturally be expressed in this framework, which in turn can be supported by an extended, non-conventional work-stealing scheduler, which we also briefly sketch. Using our new algorithm for \emph{work-stealing with deterministic team-building} we are able to show, beyond the improved, more natural implementability, in many cases better scalability and sometimes absolute performance than with less natural implementations based on pure task-parallelism executed with conventional work-stealing. Detailed performance results using an Intel 32-core system substantiate our claims. -
Martin Wimmer and Jesper Larsson Träff. Work-stealing for mixed-mode parallelism by deterministic team-building, ACM Symposium on Parallelism in Algorithms and Architectures (SPAA); San Jose, California, June 2011.
Abstract:
We show how to extend classical work-stealing to deal with tightly coupled data parallel tasks that can require any number of threads r >= 1 for their execution, and term this extension work-stealing with deterministic team-building. As threads become idle they attempt to join a team of threads designated for a task requiring r > 1 threads for its execution, alternatively to steal a task, requiring no central coordination. Team building and stealing are done according to a deterministic hierarchy and involve at most a logarithmic number of possibly randomized steal attempts. Threads attempting to join the team for a task requiring a large number of threads may help smaller teams while waiting for the large team to form. Once a team has been formed the threads can in close coordination execute the data parallel task. Implementation can be done with standard lock-free data structures, and takes only a single extra compare-and-swap (CAS) operation per thread to build a team. In the degenerate case where all tasks require only a single thread, the implementation coincides with a locality aware work-stealing implementation. Using a prototype C++ implementation of our extended work-stealing algorithm, a mixed-mode parallel Quicksort algorithm with a data parallel partitioning step has been implemented. We compare our (improved) implementation of this algorithm on top of our extended work-stealing scheduler to a standard task-parallel implementation with this scheduler, and with Intel Cilk Plus and Threading Building Blocks. In addition, we also compare to the optimized parallel MCSTL Quicksort. Results are shown for a 32-core Intel Nehalem EX system and a 16-core Sun T2+ system supporting up to 128 hardware threads. The mixed-mode parallel algorithm performs consistently better than the fork-join implementation, often significantly. -
George Russell , Colin Riley, Alastair Donaldson, Alexander S. van Amesfoort, Neil Henning, Uwe Dolinsky, Andrew Richards. The Impact of Diverse Memory Architectures on Multicore Consumer Software, MSPC 2011, San Jose, California, June 2011.
Abstract:
Memory architectures need to adapt in order for performance and scalability to be achieved in software for multicore systems. In this paper, we discuss the impact of techniques for scalable memory architectures, especially the use of multiple, non-cache-coherent memory spaces, on the implementation and performance of consumer software. Primarily, we report extensive real-world experience in this area gained by Codeplay Software Ltd., a software tools company working in the area of compilers for video games and GPU software. We discuss the solutions we use to handle variations in memory architecture in consumer software, and the impact such variations have on software development effort and, consequently, development cost. This paper introduces preliminary findings regarding impact on software, in advance of a larger-scale analysis planned over the next few years. The techniques discussed have been employed successfully in the development and optimisation of a shipping AAA cross-platform video game. -
M. Sandrieser , S. Benkner and S. Pllana. Improving Programmability of Heterogeneous Many-Core Systems via Explicit Platform Descriptions. Fourth International Workshop on Multicore Software Engineering (IWMSE11), co-located with ICSE 2011. Hawaii, USA, May 21, 2011. Copyright (C) ACM.
Abstract:
In this paper we present ongoing work towards a programming framework for heterogeneous hardware- and software environments. Our framework aims at improving programmability and portability for heterogeneous many-core systems via a Platform Description Language (PDL) for expressing architectural patterns and platform information. We developed a prototypical code generator that takes as input an annotated serial task-based program and outputs, parametrized via PDL descriptors, code for a specific target heterogeneous computing system. By varying the target PDL descriptor, code for different target configurations can be generated without the need to modify the input program. We utilize a simple task-based programming model for demonstration of our approach and present preliminary results indicating its applicability on a state-of-the-art heterogeneous system. -
M. Sandrieser , S. Benkner and S. Pllana. Explicit Platform Descriptions for Heterogeneous Many-Core Architectures. 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS). Anchorage (Alaska) USA, May 20, 2011. IPDPS Workshops 2011. Copyright (C) IEEE.
Abstract:
Heterogeneous many-core architectures offer a way to cope with energy consumption limitations of various computing systems from small mobile devices to large data-centers. However, programmers typically must consider a large diversity of architectural information to develop efficient software. In this paper we present our ongoing work towards a Platform Description Language (PDL) that enables to capture key architectural patterns of commonly used heterogeneous computing systems. PDL architecture patterns support programmers and toolchains by providing platform information in a well-defined and explicit manner. We have developed a source-to-source compiler that utilizes PDL descriptors to transform sequential task-based programs to a form that is convenient for execution on heterogeneous many-core computing systems. We show various usage scenarios of our PDL and demonstrate our approach for a commonly used scientific kernel. -
Håkan Sundell, Anders Gidenstam, Marina Papatriantafilou and Philippas Tsigas. A Lock-Free Algorithm for Concurrent Bags. In the Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2011), pages - , ACM press.
No abstract yet.
-
Nhan Nguyen and Philippas Tsigas. Progress Guarantees when Composing Lock-free Objects. In the Proceedings of the 17th International European Conference on Parallel and Distributed Computing (EUROPAR 2011), Lecture Notes in Computer Science Vol.: , pages - , Springer-Verlag 2011.
No abstract yet.
-
Håkan Sundell and Philippas Tsigas. A Lock-Free Algorithm for Concurrent Bags. In the Proceedings of the 4th Swedish Workshop on Multi-Core Computing (MCC 2011).
No abstract yet.
-
Nhan Nguyen and Philippas Tsigas. Progress Guarantees when Composing Lock-free Objects. In the Proceedings of the 4th Swedish Workshop on Multi-Core Computing (MCC 2011).
No abstract yet.
-
Usman Dastgeer, Christoph W. Kessler and Samuel Thibault. Flexible runtime support for efficient skeleton programming on hybrid systems. Proc. ParCo?-2011 Int. Conference on Parallel Computing, Ghent, Belgium, Sep. 2011.
No abstract yet.
-
Usman Dastgeer and Christoph W. Kessler. A performance-portable generic component for 2D convolution computations on GPU-based systems. Proc. 4th Swedish Workshop on Multicore Computing MCC-2011, Linköping, Sweden.
No abstract yet.
-
Akhtar Ali, Usman Dastgeer and Christoph W. Kessler. OpenCL on shared memory multicore CPUs. Proc. 4th Swedish Workshop on Multicore Computing MCC-2011, Linköping, Sweden.
No abstract yet.
PEPPHER-related publications of the Consortium in the year 2012:
-
Siegfried Benkner, Sabri Pllana, Jesper Larsson Träff, Philippas Tsigas, Uwe Dolinsky, Cèdric Augonnet, Beverly Bachmayer, Christoph Kessler, David Moloney, Vitaly Osipov, "PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems," IEEE Micro, vol. 31, no. 5, pp. 28-41, Sep./Oct. 2011, doi:10.1109/MM.2011.67
Abstract:
PEPPHER, a three-year European FP7 project, addresses efficient utilization of hybrid (heterogeneous) computer systems consisting of multicore CPUs with GPU-type accelerators. This article outlines the PEPPHER performance-aware component model, performance prediction means, runtime system, and other aspects of the project. A larger example demonstrates performance portability with the PEPPHER approach across hybrid systems with one to four GPUs. -
Christoph W. Kessler. Programming Techniques for the Cell Processor. it Information Technology 53(2): 66-75, Special issue on Multicore, April 2011, Oldenbourg-Verlag.
Abstract:
Cell Broadband Engine is a heterogeneous multicore processor designed mainly for applications in scientific computing, graphics, and gaming with high performance requirements. We give an overview of its architecture, review some selected development tools and programming frameworks, and describe techniques for writing efficient programs for Cell. -
Daniel Cederman and Philippas Tsigas. In GPU Computing Gems Jade Edition. Wen-Mei Hwu (Editor-in-Chief), Morgan Kaufmann, ISBN: 978-0-12-385963-1.
No abstract yet.
-
Christoph W. Kessler and Welf Löwe. Optimized composition of performance-aware parallel components. Concurrency and Computation: Practice and Experience, 2011. Published online in Wiley Online Library, DOI: 10.1002/cpe.1844, Sep. 2011.
No abstract yet.
Public PEPPHER deliverables in the year 2012:
-
D1.2 PEPPHER Coordination Language and Component Composition Techniques.
-
D1.3 Performance modeling and portability.
-
D2.2 Code Optimization API and Framework.
-
D3.2 Synchronization Library and Algorithmic Toolbox.
-
D4.2 Extensions and Adaptations of the PEPPHER Runtime System.
-
D7.3 Exploitation and Dissemination in Second Project Year.
PEPPHER-related publications of the Consortium in the year 2012:
-
Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Träff and Sabri Pllana. Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems. Proc. DATE-2012 Conference on Design Automation and Test in Europe, Dresden, Germany, March 2012. Copyright © 2012 EDAA, ISBN: 978-3-9810801-8-6.
Abstract:
We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a component-based approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems. -
Siegfried Benkner, Enes Bajrovic, Erich Marth, Martin Sandrieser, Raymond Namyst and Samuel Thibault. High-Level Support for Pipeline Parallelism on Manycore Architectures. Euro-Par 2012, Rhodes Island, Greece, August 27--31, 2012. Published by Springer in the LNCS series.
Abstract:
With the increasing architectural diversity of many-core architectures the challenges of parallel programming and code portability will sharply rise. The EU project PEPPHER addresses these issues with a component-based approach to application development on top of a task-parallel execution model. Central to this approach are multi-architectural components which encapsulate different implementation variants of application functionality tailored for different core types. An intelligent run-time system selects and dynamically schedules component implementation variants for efficient parallel execution on heterogeneous many-core architectures. On top of this model we have developed language, compiler and runtime support for a specific class of applications that can be expressed using the pipeline pattern. We propose C/C++ language annotations for specifying pipeline patterns and describe the associated compilation and runtime infrastructure. Experimental results indicate that with our high-level approach performance comparable to manual parallelization can be achieved. -
Daniel Cederman, Bapi Chatterjee and Philippas Tsigas. Understanding the Performance of Concurrent Data Structures on Graphics Processors. Euro-Par 2012, Rhodes Island, Greece, August 27--31, 2012. Published by Springer in the LNCS series.
Abstract:
In this paper we revisit the design of concurrent data structures -- specifically queues -- and examine their performance portability with regard to the move from conventional CPUs to graphics processors. We have looked at both lock-based and lock-free algorithms and have for comparison implemented and optimized the same algorithms on both graphics processors and multi-core CPUs. Particular interest has been paid to study the difference between the old Tesla and the new Fermi architectures in this context. We provide a comprehensive evaluation and analysis of our implementations on all platforms. Our results indicate that the queues are in general performance portable, but that platform specific optimizations are possible to increase performance. The Fermi-architecture, with optimized atomic operations, was shown to provide excellent scalability for both lock-based and lock-free queues. -
Lu Li, Usman Dastgeer and Christoph Kessler. Adaptive off-line tuning for optimized composition of components for heterogeneous many-core systems. Seventh International Workshop on Automatic Performance Tuning (iWAPT-2012), 17 July 2012, Kobe, Japan. Proc. VECPAR-2012 Conference, Kobe, Japan, July 2012.
Abstract:
In recent years heterogeneous multi-core systems have been given much attention. However, performance optimization on these platforms remains a big challenge. Optimizations performed by compilers are often limited due to lack of dynamic information and run time environment, which makes applications often not performance portable. One current approach is to provide multiple implementations for the same interface that could be used interchangeably depending on the call context, and expose the composition choices to a compiler, deployment time composition tool and/or run-time system. Using off-line machine learning techniques allows to improve the precision and reduce the runtime overhead of run-time composition and leads to an improvement of performance portability. In this work we extend the run-time composition mechanism in the PEPPHER composition tool by off-line composition and present an adaptive machine learning algorithm for generating compact and efficient dispatch data structures with low training time. As dispatch data structure we propose an adaptive decision tree structure, which implies an adaptive training algorithm that allows to control the trade-off between training time, dispatch precision and run-time dispatch overhead. We have evaluated our optimization strategy with simple kernels (matrix multiplication and sorting) as well as applications from RODINIA benchmark on two GPU-based heterogeneous systems. On average, the precision for composition choices reaches 83.6 percent with approximately 34 minutes off-line training time. -
Usman Dastgeer, Lu Li and Christoph Kessler. Performance-Aware Dynamic Composition of Applications for Heterogeneous Multicore Systems with the PEPPHER Composition Tool. Proc. 16th Int. Workshop on Compilers for Parallel Computers (CPC-2012), Padova, Italy, Jan. 2012.
Abstract:
The PEPPHER component model defines an environment for annotation of native C/C++ based components for homogeneous and heterogeneous multicore and manycore systems, including GPU and multi-GPU based systems. For the same computational functionality, captured as a component, different sequential and explicitly parallel implementation variants using various types of execution units might be provided, together with metadata such as explicitly exposed tunable parameters or programmer-provided static predictor functions for performance metrics such as execution time. The goal is to compose an application from these components and variants uch that, depending on the run-time context, the most suitable implementation variant will be chosen automatically for each invocation. In heterogeneous systems with multiple memory units, such as GPU-based systems, operand location is also to be taken into account as part of the run-time context as data transfer costs can be very significant. In this paper, we describe the PEPPHER composition tool which is a whole-program pre-compiler that explores the set of components and their implementation variants required for an application, generates wrapper (proxy, stub) code that encapsulates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code. In this way, the composition tool also provides a high-level programming front-end for the task-based PEPPHER runtime system (StarPU). The current tool prototype supports code generation for sequential C, CUDA and OpenCL. We describe the design of the PEPPHER composition tool and give examples how it is used to PEPPHERize applications. -
Usman Dastgeer and Christoph Kessler. A performance-portable generic component for 2D convolution computations on GPU-based systems. Proc. MULTIPROG-2012 Workshop at HiPEAC-2012, Paris, Jan. 2012.
Abstract:
In this paper, we describe our work on providing a generic yet optimized GPU (CUDA/OpenCL) implementation for the 2D MapOverlap? skeleton. We explain our implementation with the help of a 2D convolution application, implemented using the newly developed skeleton. The memory (constant and shared memory) and adaptive tiling optimizations are applied and their performance implications are evaluated on different classes of GPUs. We present two different metrics to calculate the optimal tiling factor dynamically in an automated way which helps in retaining best performance without manual tuning while moving to new GPU architectures. With our approach, we can achieve average speedups by a factor of 3.6, 2.3, and 2.4 over an otherwise optimized (without tiling) implementation on NVIDIA C2050, GTX280 and 8800 GT GPUs respectively. Above all, the performance portability is achieved without requiring any manual changes in the skeleton program or the skeleton implementation. -
Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG-2012 Workshop at HiPEAC-2012, Paris, Jan. 2012.
http://www.ida.liu.se/~chrke/pub/Ali-HiPEAC-MULTIPROG-2012-wksh-final16.pdf
Abstract:
Shared memory multicore processor technology is pervasive in mainstream computing. This new architecture challenges programmers to write code that scales over these many cores to exploit the full computational power of these machines. OpenMP and Intel Threading Building Blocks (TBB) are two of the popular frameworks used to program these architectures. Recently, OpenCL has been defined as a standard by Khronos group which focuses on programming a possibly heterogeneous set of processors with many cores such as CPU cores, GPUs, DSP processors. In this work, we evaluate the effectiveness of OpenCL for programming multicore CPUs in a comparative case study with OpenMP and Intel TBB for five benchmark applications: matrix multiply, LU decomposition, 2D image convolution, Pi value approximation and image histogram generation. The evaluation includes the effect of compiler optimizations for different frameworks, OpenCL performance on different vendors’ platforms and the performance gap between CPU-specific and GPU-specific OpenCL algorithms for execution on a modern GPU. Furthermore, a brief usability evaluation of the three frameworks is also presented. -
Usman Dastgeer, Lu Li and Christoph Kessler. The PEPPHER Composition Tool: Performance-Aware Dynamic Composition of Applications for GPU-based Systems. Proc. 2012 Int. Workshop on Multi-Core Computing Systems (MuCoCoS 2012), Nov. 16, 2012, Salt Lake City, Utah, USA, in conjunction with the Supercomputing Conference (SC12).
Abstract:
The PEPPHER component model defines an environment for annotation of native C/C++ based components for homogeneous and heterogeneous multicore and manycore systems, including GPU and multi-GPU based systems. For the same computational functionality, captured as a component, different sequential and explicitly parallel implementation variants using various types of execution units might be provided, together with metadata such as explicitly exposed tunable parameters. The goal is to compose an application from its components and variants such that, depending on the run-time context, the most suitable implementation variant will be chosen automatically for each invocation. - We describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code. With several applications, we demonstrate how the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath. -
Mudassar Majeed, Usman Dastgeer and Christoph Kessler. Structured Development of Scalable Scientific Applications for GPU Clusters. In Proc. MCC'12 Fifth Swedish Workshop on Multicore Computing, Nov. 2012, Stockholm.
Abstract:
SkePU is a C++ template library with a simple and unified interface for expressing data parallel computations in terms of generic components, called skeletons, on multi-GPU systems using CUDA and OpenCL. SkePU provides programmability, portability and even performance portability, but up to now applications written using SkePU could only run on a single multi-GPU node. We present the extension of SkePU for GPU clusters and demonstrate it with four scientific applications. The results show scalability of these applications over multiple nodes without the need to modify the SkePU application source code. -
Lu Li, Usman Dastgeer and Christoph Kessler. Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. In Proc. MCC'12 Fifth Swedish Workshop on Multicore Computing, Nov. 2012, Stockholm.
Abstract:
An important performance optimization in GPU-based heterogeneous systems is the selection of the fastest implementation variant among several provided ones for the same functionality, with respect to the currently available resources and relevant properties of the run-time context such as problem sizes. Empirical models based on blind executions which require no or little human eort for adaptative optimizations show more practical feasibility if the training cost can be reduced to a reasonable level. In previous work we proposed an adaptive pruning algorithm for fficient selection of training examples, a decision-tree based method for representing, predicting and selecting the fastest implementation variants for given run-time call context properties, and a composition tool for building the overall composed application from its components. The results showed a reasonably good prediction precision with a low sampling rate and accordingly reduced sampling and training time. In this paper we extend the algorithm with two improved pruning strategies: an oversampling method to improve precision by more informed pruning decisions at only modestly increased sampling cost, and threshold-based pruning to reduce sampling time and run time and space overhead additionally. Our experimental evaluation shows that these two strategies help to increase precision and/or decrease overhead. -
Christoph Kessler, Usman Dastgeer, Mudassar Majeed, Nathalie Furmento, Samuel Thibault, Raymond Namyst, Siegfried Benkner, Sabri Pllana, Jesper Larsson Träff and Martin Wimmer. Leveraging PEPPHER Technology for Performance Portable Supercomputing. In the proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC 2012), 10-16 November 2012, Salt Lake City, Utah, USA. IEEE Computer Society.
Abstract:
PEPPHER is a 3-year EU FP7 project that develops a novel approach and framework to enhance performance portability and programmability of heterogeneous multi-core systems. Its primary target is single-node heterogeneous systems, where several CPU cores are supported by accelerators such as GPUs. This poster briefly surveys the PEPPHER framework for single-node systems, and elaborates on the prospectives for leveraging the PEPPHER approach to generate performance-portable code for heterogeneous multi-node systems.
PEPPHER-related publications of the Consortium in the year 2012:
-
Martin Sandrieser, Siegfried Benkner and Sabri Pllana. Using explicit platform descriptions to support programming of heterogeneous many-core systems, Parallel Computing, Volume 38, Issues 1-2, January-February 2012, Pages 52-65, ISSN 0167-8191.
10.1016/j.parco.2011.10.008.
Abstract:
Heterogeneous many-core systems constitute a viable approach for coping with power constraints in modern computer architectures and can now be found across the whole computing landscape ranging from mobile devices, to desktop systems and servers, all the way to high-end supercomputers and large-scale data centers. While these systems promise to offer superior performance-power ratios, programming heterogeneous many-core architectures efficiently has been shown to be notoriously difficult. Programmers typically are forced to take into account a plethora of low-level architectural details and usually have to resort to a combination of different programming models within a single application. In this paper we propose a platform description language (PDL) that enables to capture key architectural patterns of commonly used heterogeneous computing systems. PDL architecture descriptions support both programmers and toolchains by providing platform-specific information in a well-defined and explicit manner. We have developed a prototype source-to-source compilation framework that utilizes PDL descriptors to transform sequential task-based programs with source code annotations into a form that is convenient for execution on heterogeneous many-core systems. Our framework relies on a component-based approach that accommodates for different implementation variants of tasks, customized for different parts of a heterogeneous platform, and utilizes an advanced runtime system for exploiting parallelism through dynamic task scheduling. We show various usage scenarios of our PDL and demonstrate the effectiveness of our framework for a commonly used scientific kernel and a financial application on different configurations of a state-of-the-art CPU/GPU system. -
Daniel Cederman and Philippas Tsigas. Supporting Lock-Free Composition of Concurrent Data Objects: Moving Data Between Containers. IEEE Transactions on Computers, 16 Oct. 2012. IEEE computer Society Digital Library. IEEE Computer Society.
http://doi.ieeecomputersociety.org/10.1109/TC.2012.248
Abstract:
Lock-free data objects offer several advantages over their blocking counterparts, such as being immune to deadlocks, priority inversion and convoying. They have also been shown to work well in practice. However, composing the operations they provide into larger atomic operations, while still guaranteeing efficiency and lock-freedom, is a challenging algorithmic task. We present a lock-free methodology for composing a wide variety of concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free {\it move} operations to a wide range of concurrent lock-free containers. This move operation allows data to be transferred from one container to another, in a lock-free way, without blocking any of the operations supported by the original container. For a data object to be suitable for composition using our methodology it needs to fulfil a set of requirements. These requirement are however generic enough to be fulfilled by a large set of objects. To show this we have performed case studies on six commonly used lock-free objects (a stack, a queue, a skip-list, a deque, a doubly linked-list and a hash-table) to demonstrate the general applicability of the methodology. We also show that the operations originally supported by the data objects keep their performance behavior under our methodology. -
Jörg Keller, Christoph Kessler and Rikard Hulten. Optimized on-chip pipelining for memory-intensive computations on multi-core processors with explicit memory hierarchy. Journal of Universal Computer Science 18(14):1987-2023, Oct. 2012.
http://www.jucs.org/jucs_18_14/optimized_on_chip_pipelining
Abstract:
Limited bandwidth to off-chip main memory tends to be a performance bottleneck in chip multiprocessors, and this will become even more problematic with an increasing number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, transforming the program into more memory-efficient code is an important program optimization. On-chip pipelining reorganizes the computation so that partial results of subtasks are forwarded immediately between the cores over the high-bandwidth internal network, in order to reduce the volume of main memory accesses, and thereby improves the throughput for memory-intensive computations. At the same time, throughput is also constrained by the limited amount of on-chip memory available for buffering forwarded data. By optimizing the mapping of tasks to cores, balancing a trade-off between load balancing, buffer memory consumption, and communication load on the on-chip network, a larger buffer size can be applied, resulting in less DMA communication and scheduling overhead. In this article, we consider parallel mergesort as a representative memory-intensive application in detail, and focus on the global merging phase, which is dominating the overall sorting time for larger data sets. We work out the technical issues of applying the on-chip pipelining technique, and present several algorithms for optimized mapping of merge trees to the multiprocessor cores. We also demonstrate how some of these algorithms can be used for mapping of other streaming task graphs. We describe an implementation of pipelined parallel mergesort for the Cell Broadband Engine, which serves as an exemplary target. We evaluate experimentally the influence of buffer sizes and mapping optimizations, and show that optimized on-chip pipelining indeed speeds up, for realistic problem sizes, merging times by up to 70% on QS20 and 143% on PS3 compared to the merge phase of CellSort?, which was by now the fastest merge sort implementation on Cell. -
Phuong Hoai Ha, Philippas Tsigas and Otto J. Anshus. Wait-free Programming for General Purpose Computations on Graphics Processors. In IEEE Transactions on Computers, IEEE press.
Abstract:
The fact that graphics processors (GPUs) are today’s most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. However, unlike CPUs, GPUs are optimized for processing 3D graphics (e.g. graphics rendering), a kind of data-parallel applications, and consequently, several GPUs do not support strong synchronization primitives to coordinate their cores. This prevents the GPUs from being deployed more widely for general-purpose computing. This paper aims at bridging the gap between the lack of strong synchronization primitives in the GPUs and the need for strong synchronization mechanisms in parallel applications. Based on the intrinsic features of typical GPU architectures, we construct strong synchronization objects such as wait-free and t-resilient read-modify-write objects for a general model of GPU architectures without hardware synchronization primitives such as test-and-set and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), where N is the number of processes. The wait-free objects have the optimal space complexity O(N2). Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without strong synchronization primitives in hardware and that wait-free programming is possible for such GPUs.