SPPEXA AnPleMeet 2017

The Hebrew University of Jerusalem
FFMK: A FAST AND FAULT-TOLERANT MICROKERNELBASED SYSTEM FOR EXASCALE COMPUTING
Amnon Barak
Hermann Härtig
Wolfgang E. Nagel
Alexander Reinefeld
Hebrew University Jerusalem (HUJI)
TU Dresden, Operating Systems Group (TUDOS)
TU Dresden, Center for Information Services and HPC (ZIH)
Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)
CARSTEN WEINHOLD, TU DRESDEN
b)
SYSTEM MODEL 7
kernel-based System for Exascale Computing
The Hebrew University of Jerusalem
a)
c)
b)
d)
M M M M M M M M
FFMK: Building an Exascale Operating System
2
The Hebrew University of Jerusalem
NODE ARCHITECTURE
Application
Application
Application
Platform Management
Service OS
InfiniBand,
TCP/IP
…
MPI
Application
Application
Runtime
Runtime
Linux
L4 Microkernel
FFMK: Building an Exascale Operating System
3
4
L4 + L LINUX
L4 microkernel controls the node
Light-weight and low-noise
Linux
App
Virtualization: L4Linux on L4 microkernel
Unmodified Linux programs (MPI, …)
Linux process = L4 task + L4 threads
Linux syscalls / exceptions:
generic forwarding to L4Linux kernel
L4Linux
L4 Microkernel / Hypervisor
Core
Core
Core
Core
FFMK: L4 Microkernel + L4Linux as an HPC Operating System
Core
4
DECOUPLED EXECUTION
Decoupling: move Linux thread to
new L4 thread on its own core
Linux
App
L4Linux
L4 Microkernel / Hypervisor
Core
Core
Core
Core
FFMK: L4 Microkernel + L4Linux as an HPC Operating System
Core
5
DECOUPLED EXECUTION
Decoupling: move Linux thread to
new L4 thread on its own core
Linux syscall: Move back to Linux
Linux
App
L4Linux
L4 Microkernel / Hypervisor
Core
Core
Core
Core
FFMK: L4 Microkernel + L4Linux as an HPC Operating System
Core
6
DECOUPLED EXECUTION
Decoupling: move Linux thread to
new L4 thread on its own core
Linux syscall: Move back to Linux
Linux
App
L4 syscalls:
Scheduling
Threads
Memory
L4Linux
Direct I/O device access
L4 Microkernel / Hypervisor
Core
Core
Core
Core
FFMK: L4 Microkernel + L4Linux as an HPC Operating System
Core
7
DECOUPLING: EP
The Hebrew University of Jerusalem
Standard Min
Standard Max
Decoupled Min
Decoupled Max
Run Time in Seconds
18,40
18,35
18,30
18,25
18,20
18,15
18,10
18,05
18,00
30
90
150
210
270
330
390
450
510
570
630
690
750
Number of Cores
Adam Lackorzynski, Carsten Weinhold, Hermann Härtig, "Decoupled: Low-Effort Noise-Free
Execution on Commodity Systems", ROSS 2016, June 2016, Kyoto, Japan
FFMK: Building an Exascale Operating System
8
The Hebrew University of Jerusalem
DECOUPLING: BSP
Standard
Decoupled
Run Time in Seconds
18,70
18,60
18,50
18,40
18,30
18,20
18,10
30
90
150
210
270
330
390
450
510
570
630
690
750
Number of Cores
Adam Lackorzynski, Carsten Weinhold, Hermann Härtig, "Decoupled: Low-Effort Noise-Free
Execution on Commodity Systems", ROSS 2016, June 2016, Kyoto, Japan
FFMK: Building an Exascale Operating System
9
The Hebrew University of Jerusalem
NODE ARCHITECTURE
Application
Application
Application
Checkpointing
Platform Management
Service OS
InfiniBand,
TCP/IP
Aries, …
MPI
Application
Application
Runtime
Runtime
Linux
L4 Microkernel
FFMK: Building an Exascale Operating System
10
eckpoint Scheduling
l Lifetime Tests
The Hebrew University of Jerusalem
CHECKPOINTING
Economic model to optimize checkpointing:
Application lifetime vs checkpoint interval
Scheduling of checkpoints to reduce
conflicts on SSD burst buffers
Online vs offline scheduling examined
FFMK: Building an Exascale Operating System
11
Figure: Changing jobs total lifetime by increasing SSD no. using online and
The Hebrew University of Jerusalem
Application
Application
Application
NODE ARCHITECTURE
Load Balancing
Platform Management
Service OS
InfiniBand,
TCP/IP
Aries, …
MPI
Application
Application
Runtime
Runtime
Linux
L4 Microkernel
FFMK: Building an Exascale Operating System
12
Dynamic Load Balancing
LOAD BALANCING
The Hebrew University of Jerusalem
Balance
workload
– Balance workload
Four objectives of dynamic load balancing
12
FFMK: Building an Exascale Operating System
13
LOAD BALANCING
Dynamic Load Balancing
The Hebrew University of Jerusalem
Four objectives of dynamic load balancing
– Balance workload
Balance workload
Dynamic Load Balancing
Four objectives of dynamic load balancing
Minimize communication
– Reduce
communication
between partibetween
partitions
tions (due to data dependencies)
– Balance workload
12
13
FFMK: Building an Exascale Operating System
14
Dynamic Load Balancing
LOAD BALANCING
Dynamic Load Balancing
The Hebrew University of Jerusalem
Four objectives of dynamic load balancing
– Balance workload
Four objectives of dynamic load balancing
– Balance workload
Dynamic Load
Balancing
Balance
workload
– Reduce communication between
Fourpartiobjectives of dynamic load balanc
Dynamic Load Balancing
Four objectives of dynamic load balancing
Dynamic
LoadtoBalancing
tions (due
data dependencies)
– Balance workload
communication between parti– Balance– Reduce
workload
tions (due to data dependencies)
– Reduce migration, i.e. communication
– Reduce communication between pa
Four
objectivescommunication
of dynamic load balancing
Minimize
when changing
the partitioning
tions (due to data dependencies)
– Balance workload
between partitions
– Reduce migration, i.e. communicat
– Reduce communication between partiwhen changing the partitioning
tions (due to data dependencies)
12
– Reduce migration, i.e. communication
Minimize
migration
when changing
the partitioning
13
14
FFMK: Building an Exascale Operating System
15
LOAD BALANCING
Dynamic Load Balancing
The Hebrew University of Jerusalem
Four objectives of dynamic load balancing
– Balance workload
Dynamic Load Balancing
Balance workload
Dynamic Load Balancing
Dynamic Load Balancing
Four objectives of dynamic load balancing
Four objectives of dynamic load
balancing
objectives
– Four
Balance
workload of dynamic load balancing
– Balance workload
– Reduce
communication
– Balance
workloadbetween partitions (due to data dependencies)
– Reduce communication between
– Reduce
parti-communication between partitions (due to data dependencies)
tions (due to data dependencies)
Minimize communication
between partitions
– Reduce migration, i.e. communication
– Reduce migration, i.e. communication
when changing the partitioning
when changing the partitioning
12
Minimize migration
13
14
Compute new partitions fast
FFMK: Building an Exascale Operating System
16
Motivation: Di<usive Load Balanc
DIFFUSION
Fully distributed method
• Local operations lead to global
convergence
The Hebrew University of Jerusalem
LoadLoad
perper
node
iterations
nodeover
over iterations
Practical application is rare
• Well described since the 1990's
• Only few papers show real use in HPC
FFMK: Building an Exascale Operating System
17
The Hebrew University of Jerusalem
DIFFUSION RESULTS
Performance Comparison: 1Ki-8Ki weak scaling
=
max(l v )
− 1
avg (l v )
Max tasks
sent+received
among all procs
Max number of
task mesh edges
cut by partition
borders among
all procs
Median run time of 61 runs on
Taurus, Intel Haswell + In8niband
FDR cluster with Intel MPI,
error bars show 25/75 percentiles
Iterations until
Gows lead to
0.1% imbalance
(before task
selection)
• Di<usion
leads
to smallest
migration
Diffusion
leads to
smallest
migration
• Di<usion achieves very good edge cut
Diffusion
achieves
good
• Di<usion
run very
time ca.
2 ms edge
for 8192cut
processes, Zoltan much slower
Slide 14
Diffusion run time ~2 ms for 8192 processes, Zoltan much slower
Matthias Lieber, Kerstin Gößner, Wolfgang E. Nagel, "The Potential of Diffusive Load Balancing at
Large Scale", EuroMPI 2016, June 2016, Edinburgh, United Kingdom
FFMK: Building an Exascale Operating System
18
The Hebrew University of Jerusalem
NODE ARCHITECTURE
Application
Application
Application
Gossip
Platform Management
Service OS
InfiniBand,
TCP/IP
Aries, …
MPI
Application
Application
Runtime
Runtime
Linux
L4 Microkernel
FFMK: Building an Exascale Operating System
19
The Hebrew University of Jerusalem
FFMK: Building an Exascale Operating System
GOSSIP
20
The Hebrew University of Jerusalem
FFMK: Building an Exascale Operating System
GOSSIP
21
The Hebrew University of Jerusalem
STEP 1: GOSSIP
FFMK: Building an Exascale Operating System
22
The Hebrew University of Jerusalem
STEP 2: CORRECTION
FFMK: Building an Exascale Operating System
23
The Hebrew University of Jerusalem
FT COLLECTIVES
Two-stage algorithm: gossip + correction
Main advantage: scalability and resilience
(continues to work in presence of failures)
Works for: fault-tolerant broadcast
Next step: extend to operations that
include barrier semantics
Future: use in MPI?
Torsten Hoefler, Amnon Barak, Amnon Shiloh and Zvi Drezner, "Corrected Gossip Algorithms for
Fast Reliable Broadcast on Unreliable Systems", Accepted for IPDPS'17, Orlando, FL, USA
FFMK: Building an Exascale Operating System
24
SUMMARY
The Hebrew University of Jerusalem
Decoupled threads: reduced noise
Checkpointing: Economic model
Diffusion: may be efficient alternative
Corrected Gossip: fault-tolerant broadcast
Work in progress: integrate monitoring +
gossip + decision making + migration
German Priority Programme 1648
“Software for Exascale Computing”
SPPEXA – Findings & Goals
FFMK: Building an Exascale Operating
System
Massive parallelism (on- and cross-chip)
requires fundamentally new concepts
Not “racks without brains”, but software is
SPPEXA – Chronology
25
2006: discussion in the German Research
Foundation (DFG) on the necessity of a
funding initiative for HPC software