Accelerating Network Communication and I/O in Scientific High Performance Computing Environments

Neuwirth, Sarah Marie

[thumbnail of dissertation_sneuwirth_publish.pdf]

Preview

PDF, English - main document
Download (20MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00025757
URN: urn:nbn:de:bsz:16-heidok-257571
URL: http://www.ub.uni-heidelberg.de/archiv/25757

Abstract

High performance computing has become one of the major drivers behind technology inventions and science discoveries. Originally driven through the increase of operating frequencies and technology scaling, a recent slowdown in this evolution has led to the development of multi-core architectures, which are supported by accelerator devices such as graphics processing units (GPUs). With the upcoming exascale era, the overall power consumption and the gap between compute capabilities and I/O bandwidth have become major challenges. Nowadays, the system performance is dominated by the time spent in communication and I/O, which highly depends on the capabilities of the network interface. In order to cope with the extreme concurrency and heterogeneity of future systems, the software ecosystem of the interconnect needs to be carefully tuned to excel in reliability, programmability, and usability.

This work identifies and addresses three major gaps in today's interconnect software systems. The I/O gap describes the disparity in operating speeds between the computing capabilities and second storage tiers. The communication gap is introduced through the communication overhead needed to synchronize distributed large-scale applications and the mixed workload. The last gap is the so called concurrency gap, which is introduced through the extreme concurrency and the inflicted learning curve posed to scientific application developers to exploit the hardware capabilities.

The first contribution is the introduction of the network-attached accelerator approach, which moves accelerators into a "stand-alone" cluster connected through the Extoll interconnect. The novel communication architecture enables the direct accelerators communication without any host interactions and an optimal application-to-compute-resources mapping. The effectiveness of this approach is evaluated for two classes of accelerators: Intel Xeon Phi coprocessors and NVIDIA GPUs.

The next contribution comprises the design, implementation, and evaluation of the support of legacy codes and protocols over the Extoll interconnect technology. By providing TCP/IP protocol support over Extoll, it is shown that the performance benefits of the interconnect can be fully leveraged by a broader range of applications, including the seamless support of legacy codes.

The third contribution is twofold. First, a comprehensive analysis of the Lustre networking protocol semantics and interfaces is presented. Afterwards, these insights are utilized to map the LNET protocol semantics onto the Extoll networking technology. The result is a fully functional Lustre network driver for Extoll. An initial performance evaluation demonstrates promising bandwidth and message rate results.

The last contribution comprises the design, implementation, and evaluation of two easy-to-use load balancing frameworks, which transparently distribute the I/O workload across all available storage system components. The solutions maximize the parallelization and throughput of file I/O. The frameworks are evaluated on the Titan supercomputing systems for three I/O interfaces. For example for large-scale application runs, POSIX I/O and MPI-IO can be improved by up to 50% on a per job basis, while HDF5 shows performance improvements of up to 32%.

Translation of abstract (German)

Hochleistungsrechnen hat sich zu einem der bedeutendsten Standbeine im Bereich der technischen und wissenschaftlichen Errungenschaften entwickelt. Ursprünglich wurde die Leistungssteigerung solcher Systeme durch die kontinuierliche Steigerung der Taktfrequenz gewährleistet. Die Verlangsamung dieses Trends hat zu der Entwicklung von Mehrkernarchitekturen geführt, welche zusätzlich durch sogenannte Beschleuniger wie etwa Graphikprozessoren unterstützt werden. In Verbindung mit der bevorstehenden Exascale Ära haben sich vor allem der Gesamtstromverbrauch und die Kluft zwischen Rechenkapazität und I/O Bandbreite als limitierende Faktoren herauskristallisiert. Die gegenwärtige Systemleistung wird vor allem durch die Kommunikations- und I/O-Zeit beschränkt. Dieses Phänomen hängt insbesondere mit den Eigenschaften der Netzwerkschnittstelle zusammen. Um den extremen Anforderungen in Hinblick auf Parallelität und Heterogenität zukünftiger Systeme gerecht zu werden, bedarf es einer sorgfältigen Abstimmung der Software-Komponenten, insbesondere im Hinblick auf Zuverlässigkeit, Programmierbarkeit und Benutzerfreundlichkeit.

Diese Arbeit identifiziert und widmet sich insbesondere drei Leistungslücken (engl. gaps), die in den heutigen Softwareumgebungen im Bereich der Verbindungsnetzwerke beobachtet werden können. Die sogenannte I/O Gap beschreibt die Leistungslücke, die entsteht durch die unterschiedlichen Taktfrequenzen von Rechenkapazitäten und der Speicherhierarchie. Die sogenannte Communication Gap beschreibt den Kommunikationsoverhead, der bei der Synchronisierung von Großanwendungen entsteht, aber auch die gemischte Kommunikationslast, die auf das Netzwerk ausgeübt wird. Die letzte Leistungslücke wird durch die sogenannte Concurrency Gap beschrieben. Diese Leistungslücke entsteht durch die extreme Parallelität von modernen Hochleistungsrechnern und dem für Programmierer dadurch verbundenen Mehraufwand, die Hardware-Eigenschaften dieser Systeme auszunutzen.

Im ersten Beitrag wird der Ansatz der Network-Attached Accelerators („Netzwerk-Beschleuniger“) als neue Kommunikationsarchitektur vorgestellt. Dieses neuartige Konzept ermöglicht es, Beschleuniger aus der bislang statischen Hardware-Anordnung zu entkoppeln diese über das Netzwerk allen Rechenknoten gleichförmig zur Verfügung zu stellen. Vorteile dieser Architekturform sind unter anderem, dass Grafikprozessoren direkt über das Netzwerk miteinander kommunizieren können, aber auch das dynamische Abbilden von Anwendungen auf Rechenressourcen zur Laufzeit. Die Leistungsfähigkeit dieses Ansatzes wird anhand der Intel Xeon Phi Co-Prozessoren und NVIDIA Grafikprozessoren analysiert.

Im nächsten Beitrag wird der Fokus auf die Unterstützung von sogenannten Legacy Anwendungen und Protokollen über Hochgeschwindigkeitsnetzwerke wie Extoll gelegt. Durch die Erweiterung der Extoll Softwareumgebung zur Unterstützung der TCP/IP Protokollfamilie können Legacy Anwendungen das Leistungsspektrum der Extoll Technologie voll ausschöpfen, ohne dafür modifiziert werden zu müssen. Dies öffnet die Tür für ein breiteres Spektrum an Anwedungen.

Der dritte Beitrag liefert zunächst eine ausführliche Analyse des Lustre Netzwerkprotokolls, welches LNET genannt wird. Im Anschluss werden diese Erkenntnisse genutzt, um die Protokoll-Semantik von LNET effizient auf die Extoll Netzwerktechnologie abzubilden. Es wird ein voll funktionsfähiger Lustre Netzwerktreiber (Lustre Network Driver) für Extoll implementiert. Eine initiale Leistungsanalyse zeigt vielsprechende Ergebnisse, gerade im Hinblick auf die erzielte Bandbreite und Nachrichtenrate.

Der letzte Beitrag besteht in der Konzeption, Implementierung und Evaluation zweier benutzerfreundlicher Anwender-Frameworks, welche die Datenlast einer Großanwendung transparent über alle verfügbaren Komponenten des Speichersystems verteilen. Diese beiden Lösungen dienen der optimalen Parallelisierung und Maximierung der Bandbreite in Bezug auf das Lesen und Schreiben von Dateien. Die beiden Frameworks werden auf dem Hochleistungsrechner Titan mit drei verschiedenen I/O Schnittstellen (I/O interfaces) evaluiert. Für Großanwendungen kann die Leistung von POSIX und MPI-IO beispielsweise um bis zu 50% gesteigert werden, für HDF5 kann eine Leistungssteigerung von bis zu 32% erzielt werden.

Document type:	Dissertation
Supervisor:	Brüning, Prof. Dr. Ulrich
Date of thesis defense:	17 December 2018
Date Deposited:	09 Jan 2019 07:52
Date:	2019
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Dean's Office of The Faculty of Mathematics and Computer Science Service facilities > Institut f. Technische Informatik (ZITI)
DDC-classification:	004 Data processing Computer science 600 Technology (Applied sciences)
Controlled Keywords:	Distributed System, Parallel Computing
Uncontrolled Keywords:	Network-Attached Accelerators, Network Communication