Communication Architectures for Scalable GPU-centric Computing Systems

Klenk, Benjamin

[thumbnail of bklenk_dissertation_final.pdf]

Preview

PDF, English
Download (5MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00023954
URN: urn:nbn:de:bsz:16-heidok-239547
URL: http://www.ub.uni-heidelberg.de/archiv/23954

Abstract

In recent years, power consumption has become the main concern in High Performance Computing (HPC). This has lead to heterogeneous computing systems in which Central Processing Units (CPUs) are supported by accelerators, such as Graphics Processing Units (GPUs). While GPUs used to be seen as slave devices to which the main processor offloads computation, today’s systems tend to deploy more GPUs than CPUs. Eventually, the GPU will become a first-class processor, bearing increasing responsibilities. Promoting the GPU to a first-class processor comes with many challenges, such as progress guarantees, dynamic memory management, and scheduling. However, one of the main challenges is the GPU’s inability to orchestrate communication, which is currently entirely handled by the CPU. This work addresses that issue and presents solutions to allow GPUs to source and sink network traffic independently. Many important aspects are addressed, ranging from the application level to how networking hardware is accessed. First, important and large scale exascale applications are studied to further understand their communication behavior and applications’ requirements. Several metrics are presented, including time spent for communication, message sizes, and the length of queues that are required to match messages with receive requests. One aspect the analysis revealed is that messages are becoming smaller at scale, which renders the matching of messages and receive requests an important problem to address. The next part analyzes how the GPU can directly access the network with various communication models being presented and benchmarked. It is shown that a flat address space of distributed GPU memories shows superior bandwidth than put/get communication or CPU-controlled message passing, but less communication can be overlapped with computation. Overall, GPU-controlled communication is always superior, both in terms of time-to-solution and energy spending. The final part addresses communication management on GPUs, which is required to provide high-level communication abstractions. Besides other fundamental building blocks, an algorithm for the message matching is presented that yields similar performance as CPUs. However, it is also shown that the messaging protocol can be relaxed to improve performance significantly, leveraging the massive amount of parallelism provided by the GPU’s architecture.

Document type:	Dissertation
Supervisor:	Fröning, Prof. Dr. Holger
Date of thesis defense:	9 January 2018
Date Deposited:	12 Jan 2018 08:50
Date:	2018
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Dean's Office of The Faculty of Mathematics and Computer Science Service facilities > Institut f. Technische Informatik (ZITI)
DDC-classification:	004 Data processing Computer science 600 Technology (Applied sciences)
Controlled Keywords:	Distributed System, Graphics Processing Unit