The Hebrew University of Jerusalem FFMK: A FAST AND FAULT-TOLERANT MICROKERNELBASED SYSTEM FOR EXASCALE COMPUTING Amnon Barak Hermann Härtig Wolfgang E. Nagel Alexander Reinefeld Hebrew University Jerusalem (HUJI) TU Dresden, Operating Systems Group (TUDOS) TU Dresden, Center for Information Services and HPC (ZIH) Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB) CARSTEN WEINHOLD, TU DRESDEN b) SYSTEM MODEL 7 kernel-based System for Exascale Computing The Hebrew University of Jerusalem a) c) b) d) M M M M M M M M FFMK: Building an Exascale Operating System 2 The Hebrew University of Jerusalem NODE ARCHITECTURE Application Application Application Platform Management Service OS InfiniBand, TCP/IP … MPI Application Application Runtime Runtime Linux L4 Microkernel FFMK: Building an Exascale Operating System 3 4 L4 + L LINUX L4 microkernel controls the node Light-weight and low-noise Linux App Virtualization: L4Linux on L4 microkernel Unmodified Linux programs (MPI, …) Linux process = L4 task + L4 threads Linux syscalls / exceptions: generic forwarding to L4Linux kernel L4Linux L4 Microkernel / Hypervisor Core Core Core Core FFMK: L4 Microkernel + L4Linux as an HPC Operating System Core 4 DECOUPLED EXECUTION Decoupling: move Linux thread to new L4 thread on its own core Linux App L4Linux L4 Microkernel / Hypervisor Core Core Core Core FFMK: L4 Microkernel + L4Linux as an HPC Operating System Core 5 DECOUPLED EXECUTION Decoupling: move Linux thread to new L4 thread on its own core Linux syscall: Move back to Linux Linux App L4Linux L4 Microkernel / Hypervisor Core Core Core Core FFMK: L4 Microkernel + L4Linux as an HPC Operating System Core 6 DECOUPLED EXECUTION Decoupling: move Linux thread to new L4 thread on its own core Linux syscall: Move back to Linux Linux App L4 syscalls: Scheduling Threads Memory L4Linux Direct I/O device access L4 Microkernel / Hypervisor Core Core Core Core FFMK: L4 Microkernel + L4Linux as an HPC Operating System Core 7 DECOUPLING: EP The Hebrew University of Jerusalem Standard Min Standard Max Decoupled Min Decoupled Max Run Time in Seconds 18,40 18,35 18,30 18,25 18,20 18,15 18,10 18,05 18,00 30 90 150 210 270 330 390 450 510 570 630 690 750 Number of Cores Adam Lackorzynski, Carsten Weinhold, Hermann Härtig, "Decoupled: Low-Effort Noise-Free Execution on Commodity Systems", ROSS 2016, June 2016, Kyoto, Japan FFMK: Building an Exascale Operating System 8 The Hebrew University of Jerusalem DECOUPLING: BSP Standard Decoupled Run Time in Seconds 18,70 18,60 18,50 18,40 18,30 18,20 18,10 30 90 150 210 270 330 390 450 510 570 630 690 750 Number of Cores Adam Lackorzynski, Carsten Weinhold, Hermann Härtig, "Decoupled: Low-Effort Noise-Free Execution on Commodity Systems", ROSS 2016, June 2016, Kyoto, Japan FFMK: Building an Exascale Operating System 9 The Hebrew University of Jerusalem NODE ARCHITECTURE Application Application Application Checkpointing Platform Management Service OS InfiniBand, TCP/IP Aries, … MPI Application Application Runtime Runtime Linux L4 Microkernel FFMK: Building an Exascale Operating System 10 eckpoint Scheduling l Lifetime Tests The Hebrew University of Jerusalem CHECKPOINTING Economic model to optimize checkpointing: Application lifetime vs checkpoint interval Scheduling of checkpoints to reduce conflicts on SSD burst buffers Online vs offline scheduling examined FFMK: Building an Exascale Operating System 11 Figure: Changing jobs total lifetime by increasing SSD no. using online and The Hebrew University of Jerusalem Application Application Application NODE ARCHITECTURE Load Balancing Platform Management Service OS InfiniBand, TCP/IP Aries, … MPI Application Application Runtime Runtime Linux L4 Microkernel FFMK: Building an Exascale Operating System 12 Dynamic Load Balancing LOAD BALANCING The Hebrew University of Jerusalem Balance workload – Balance workload Four objectives of dynamic load balancing 12 FFMK: Building an Exascale Operating System 13 LOAD BALANCING Dynamic Load Balancing The Hebrew University of Jerusalem Four objectives of dynamic load balancing – Balance workload Balance workload Dynamic Load Balancing Four objectives of dynamic load balancing Minimize communication – Reduce communication between partibetween partitions tions (due to data dependencies) – Balance workload 12 13 FFMK: Building an Exascale Operating System 14 Dynamic Load Balancing LOAD BALANCING Dynamic Load Balancing The Hebrew University of Jerusalem Four objectives of dynamic load balancing – Balance workload Four objectives of dynamic load balancing – Balance workload Dynamic Load Balancing Balance workload – Reduce communication between Fourpartiobjectives of dynamic load balanc Dynamic Load Balancing Four objectives of dynamic load balancing Dynamic LoadtoBalancing tions (due data dependencies) – Balance workload communication between parti– Balance– Reduce workload tions (due to data dependencies) – Reduce migration, i.e. communication – Reduce communication between pa Four objectivescommunication of dynamic load balancing Minimize when changing the partitioning tions (due to data dependencies) – Balance workload between partitions – Reduce migration, i.e. communicat – Reduce communication between partiwhen changing the partitioning tions (due to data dependencies) 12 – Reduce migration, i.e. communication Minimize migration when changing the partitioning 13 14 FFMK: Building an Exascale Operating System 15 LOAD BALANCING Dynamic Load Balancing The Hebrew University of Jerusalem Four objectives of dynamic load balancing – Balance workload Dynamic Load Balancing Balance workload Dynamic Load Balancing Dynamic Load Balancing Four objectives of dynamic load balancing Four objectives of dynamic load balancing objectives – Four Balance workload of dynamic load balancing – Balance workload – Reduce communication – Balance workloadbetween partitions (due to data dependencies) – Reduce communication between – Reduce parti-communication between partitions (due to data dependencies) tions (due to data dependencies) Minimize communication between partitions – Reduce migration, i.e. communication – Reduce migration, i.e. communication when changing the partitioning when changing the partitioning 12 Minimize migration 13 14 Compute new partitions fast FFMK: Building an Exascale Operating System 16 Motivation: Di<usive Load Balanc DIFFUSION Fully distributed method • Local operations lead to global convergence The Hebrew University of Jerusalem LoadLoad perper node iterations nodeover over iterations Practical application is rare • Well described since the 1990's • Only few papers show real use in HPC FFMK: Building an Exascale Operating System 17 The Hebrew University of Jerusalem DIFFUSION RESULTS Performance Comparison: 1Ki-8Ki weak scaling = max(l v ) − 1 avg (l v ) Max tasks sent+received among all procs Max number of task mesh edges cut by partition borders among all procs Median run time of 61 runs on Taurus, Intel Haswell + In8niband FDR cluster with Intel MPI, error bars show 25/75 percentiles Iterations until Gows lead to 0.1% imbalance (before task selection) • Di<usion leads to smallest migration Diffusion leads to smallest migration • Di<usion achieves very good edge cut Diffusion achieves good • Di<usion run very time ca. 2 ms edge for 8192cut processes, Zoltan much slower Slide 14 Diffusion run time ~2 ms for 8192 processes, Zoltan much slower Matthias Lieber, Kerstin Gößner, Wolfgang E. Nagel, "The Potential of Diffusive Load Balancing at Large Scale", EuroMPI 2016, June 2016, Edinburgh, United Kingdom FFMK: Building an Exascale Operating System 18 The Hebrew University of Jerusalem NODE ARCHITECTURE Application Application Application Gossip Platform Management Service OS InfiniBand, TCP/IP Aries, … MPI Application Application Runtime Runtime Linux L4 Microkernel FFMK: Building an Exascale Operating System 19 The Hebrew University of Jerusalem FFMK: Building an Exascale Operating System GOSSIP 20 The Hebrew University of Jerusalem FFMK: Building an Exascale Operating System GOSSIP 21 The Hebrew University of Jerusalem STEP 1: GOSSIP FFMK: Building an Exascale Operating System 22 The Hebrew University of Jerusalem STEP 2: CORRECTION FFMK: Building an Exascale Operating System 23 The Hebrew University of Jerusalem FT COLLECTIVES Two-stage algorithm: gossip + correction Main advantage: scalability and resilience (continues to work in presence of failures) Works for: fault-tolerant broadcast Next step: extend to operations that include barrier semantics Future: use in MPI? Torsten Hoefler, Amnon Barak, Amnon Shiloh and Zvi Drezner, "Corrected Gossip Algorithms for Fast Reliable Broadcast on Unreliable Systems", Accepted for IPDPS'17, Orlando, FL, USA FFMK: Building an Exascale Operating System 24 SUMMARY The Hebrew University of Jerusalem Decoupled threads: reduced noise Checkpointing: Economic model Diffusion: may be efficient alternative Corrected Gossip: fault-tolerant broadcast Work in progress: integrate monitoring + gossip + decision making + migration German Priority Programme 1648 “Software for Exascale Computing” SPPEXA – Findings & Goals FFMK: Building an Exascale Operating System Massive parallelism (on- and cross-chip) requires fundamentally new concepts Not “racks without brains”, but software is SPPEXA – Chronology 25 2006: discussion in the German Research Foundation (DFG) on the necessity of a funding initiative for HPC software
© Copyright 2026 Paperzz