Spark

Accelerating Spark with RDMA for
Big Data Processing: Early Experiences
Xiaoyi Lu, Md. Wasi-­‐ur-­‐Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda Network-­‐Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA Outline •  Introduc7on •  Problem Statement •  Proposed Design •  Performance Evalua7on •  Conclusion & Future work
2
HotI 2014
Digital Data Explosion in the Society
Figure Source (http://www.csc.com/insights/flxwd/78931-big_data_growth_just_beginning_to_explode)
HotI 2014
3
Big Data Technology - Hadoop
•  Apache Hadoop is one of the most popular Big Data technology
–  Provides frameworks for large-scale, distributed data storage and
processing
–  MapReduce, HDFS, YARN, RPC, etc.
Hadoop 2.x
Hadoop 1.x
MapReduce
(Data Processing)
Other Models
(Data Processing)
MapReduce
(Cluster Resource Management & Data
Processing)
YARN
(Cluster Resource Management & Job
Scheduling)
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS)
Hadoop Common/Core (RPC, ..)
Hadoop Common/Core (RPC, ..)
4
HotI 2014
Big Data Technology - Spark •  An emerging in-­‐memory processing framework – 
– 
– 
– 
Worker
Zookeeper
Itera7ve machine learning Interac7ve data analy7cs Scala based Implementa7on Master-­‐Slave; HDFS, Zookeeper •  Scalable and communica7on intensive Worker
Worker
HDFS
Driver
SparkContext
Worker
Master
Worker
–  Wide dependencies between Resilient Distributed Datasets (RDDs) –  MapReduce-­‐like shuffle opera7ons to repar77on RDDs –  Same as Hadoop, Socketsbased communication Narrow Dependencies
HotI 2014
Wide dependencies
5
Common Protocols using Open Fabrics Applica-on Applica7on Interface Sockets Verbs Protocol Kernel Space
TCP/IP RSockets SDP
iWARP RDMA RDMA User
space
RDMA User
space
User
space
User
space
TCP/IP Ethernet Driver IPoIB Hardware Offload Adapter Ethernet Adapter InfiniBand Adapter Ethernet Adapter InfiniBand Adapter InfiniBand Adapter iWARP Adapter RoCE Adapter InfiniBand Adapter InfiniBand Switch Ethernet Switch InfiniBand Switch InfiniBand Switch Ethernet Switch Ethernet Switch InfiniBand Switch RSockets SDP iWARP RoCE IB Verbs Switch Ethernet Switch 1/10/40 GigE IPoIB 10/40 GigE-­‐
TOE 6
HotI 2014
Previous Studies
•  Very good performance improvements for Hadoop (HDFS
/MapReduce/RPC), HBase, Memcached over InfiniBand
–  Hadoop Acceleration with RDMA
•  N. S. Islam, et.al., SOR-HDFS: A SEDA-based Approach to Maximize
Overlapping in RDMA-Enhanced HDFS, HPDC’14
•  N. S. Islam, et.al., High Performance RDMA-Based Design of HDFS over
InfiniBand, SC’12
•  M. W. Rahman, et.al. HOMR: A Hybrid Approach to Exploit Maximum
Overlapping in MapReduce over High Performance Interconnects, ICS’14
•  M. W. Rahman, et.al., High-Performance RDMA-based Design of Hadoop
MapReduce over InfiniBand, HPDIC’13
•  X. Lu, et.al., High-Performance Design of Hadoop RPC with RDMA over
InfiniBand, ICPP’13
–  HBase Acceleration with RDMA
•  J. Huang, et.al., High-Performance Design of HBase with RDMA over InfiniBand,
IPDPS’12
–  Memcached Acceleration with RDMA
•  J. Jose, et.al., Memcached Design on High Performance RDMA Capable
Interconnects, ICPP’11
HotI 2014
7
The High-Performance Big Data (HiBD) Project
•  RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
•  RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
•  RDMA for Memcached (RDMA-Memcached)
•  OSU HiBD-Benchmarks (OHB)
•  http://hibd.cse.ohio-state.edu
8
HotI 2014
RDMA for Apache Hadoop
•  High-Performance Design of Hadoop over RDMA-enabled Interconnects
–  High performance design with native InfiniBand and RoCE support at the
verbs-level for HDFS, MapReduce, and RPC components
–  Easily configurable for native InfiniBand, RoCE and the traditional sockets
-based support (Ethernet and InfiniBand with IPoIB)
•  Current release: 0.9.9 (03/31/14)
–  Based on Apache Hadoop 1.2.1
–  Compliant with Apache Hadoop 1.2.1 APIs and applications
–  Tested with
•  Mellanox InfiniBand adapters (DDR, QDR and FDR)
•  RoCE support with Mellanox adapters
•  Various multi-core platforms
•  Different file systems with disks and SSDs
http://hibd.cse.ohio-state.edu
•  RDMA for Apache Hadoop 2.x 0.9.1 is released in HiBD!
9
HotI 2014
Performance Benefits – RandomWriter & Sort in SDSC-Gordon
160
IPoIB (QDR)
OSU-IB (QDR)
Reduced by 20%
140
120
100
80
60
Job Execution Time (sec)
Job Execution Time (sec)
IPoIB (QDR)
400
OSU-IB (QDR)
Reduced by 36%
350
300
250
200
150
Can RDMA benefit Apache Spark
on High-Performance Networks also?
40
20
0
50 GB
100 GB
200 GB
300 GB
Cluster: 16
Cluster: 32
Cluster: 48
Cluster: 64
•  RandomWriter in a cluster of single SSD per node –  16% improvement over IPoIB for 50GB in a cluster of 16 nodes –  20% improvement over IPoIB for 300GB in a cluster of 64 nodes HotI 2014
100
50
0
50 GB
100 GB
200 GB
300 GB
Cluster: 16
Cluster: 32
Cluster: 48
Cluster: 64
•  Sort in a cluster of single SSD per node –  20% improvement over IPoIB for 50GB in a cluster of 16 nodes –  36% improvement over IPoIB for 300GB in a cluster of 64 nodes 10
Outline •  Introduc7on •  Problem Statement •  Proposed Design •  Performance Evalua7on •  Conclusion & Future work
11
HotI 2014
Problem Statement
•  Is it worth it?
–  Is the performance improvement potential high enough, if we can
successfully adapt RDMA to Spark?
–  A few percentage points or orders of magnitude?
•  How difficult is it to adapt RDMA to Spark?
–  Can RDMA be adapted to suit the communication needs of
Spark?
–  Is it viable to have to rewrite portions of the Spark code with
RDMA?
•  Can Spark applications benefit from an RDMA-enhanced
design?
–  What are the performance benefits that can be achieved by using
RDMA for Spark applications on modern HPC clusters?
–  Can RDMA-based design benefit applications transparently?
12
HotI 2014
Assessment of Performance Improvement Potential
•  How much benefit RDMA can bring in Apache Spark
compared to other interconnects/protocols?
•  Assessment Methodology
–  Evaluation on primitive-level micro-benchmarks
•  Latency, Bandwidth
•  1GigE, 10GigE, IPoIB (32Gbps), RDMA (32Gbps)
•  Java/Scala-based environment
–  Evaluation on typical workloads
•  GroupByTest
•  1GigE, 10GigE, IPoIB (32Gbps)
•  Spark 0.9.1
13
HotI 2014
Evaluation on Primitive-level Micro-benchmarks
120
40
1GigE
10GigE
IPoIB (32Gbps)
RDMA (32Gbps)
100
30
Time (ms)
80
Time (us)
1GigE
10GigE
IPoIB (32Gbps)
RDMA (32Gbps)
35
60
40
25
20
15
10
Can these benefits of High-Performance
Networks be achieved in
Apache
Spark?
•  Obviously,
IPoIB and 10GigE are
20
5
0
4K
0
1
2
4
8
16 32 64 128 256 512 1K
Message Size (bytes)
2K
Small Message
8K
16K 32K 64K 128K 256K 512K 1M
Message Size (bytes)
2M
4M
Large Message
3,000
Bandwidth (MBps)
2,500
2,000
1,500
1,000
500
0
1GigE
10GigE
IPoIB
(32Gbps)
RDMA
(32Gbps)
much better than 1GigE.
•  Compared to other interconnects
/protocols, RDMA can significantly
–  reduce the latencies for all the
message sizes
–  improve the peak bandwidth
14
HotI 2014
Evaluation on Typical Workloads
65%
1GigE
10GigE
IPoIB (32Gbps)
25
• 
For High-Performance Networks,
–  The execution time of GroupBy is
significantly improved
–  The network throughputs are much
higher for both send and receive
20
15
22%
10
0
•  Spark can benefit from high
-performance interconnects to a
greater extent!
Can RDMA further benefit Spark
performance compared with IPoIB and
10GigE?
4
6
8
Data Size (GB)
600
10
600
1GigE
10GigE
IPoIB (32Gbps)
500
Network Throughput (MB/s)
5
Network Throughput (MB/s)
GroupBy Time (sec)
30
400
300
200
100
0
1GigE
10GigE
IPoIB (32Gbps)
500
400
300
200
100
0
5
10
15
Sample Times (s)
5
20
10
15
Sample Times (s)
Network Throughput in Send
Network Throughput in Recv
HotI 2014
20
15
Outline •  Introduc7on •  Problem Statement •  Proposed Design •  Performance Evalua7on •  Conclusion & Future work
16
HotI 2014
Architecture Overview
Spark Applications
•  Design Goals
–  High Performance
–  Keeping the existing
Spark architecture
and interface intact
–  Minimal code
changes
•  Approaches
(Scala/Java/Python)
Task
Task
Task
Task
Spark
(Scala/Java)
BlockManager
Java NIO
Shuffle
Server
(default)
Netty
Shuffle
Server
(optional)
BlockFetcherIterator
RDMA
Shuffle
Server
(plug-in)
Java NIO
Shuffle
Fetcher
(default)
Netty
Shuffle
Fetcher
(optional)
RDMA
Shuffle
Fetcher
(plug-in)
Java Socket
RDMA-based Shuffle Engine
(Java/JNI)
1/10 Gig Ethernet/IPoIB (QDR/FDR)
Network
Native InfiniBand
(QDR/FDR)
–  Plug-in based approach to extend shuffle framework in Spark
•  RDMA Shuffle Server + RDMA Shuffle Fetcher
•  100 lines of code changes inside Spark original files
–  RDMA-based Shuffle Engine
17
HotI 2014
SEDA-based Data Shuffle Plug-ins
•  SEDA - Staged Event-Driven
Architecture
•  A set of stages connected by
queues
•  A dedicated thread pool will be in
charge of processing events on
the corresponding queue
Connection
Pool
Request
Queue
Request
Enqeue
Response
Queue
Request
Dequeue
Response
Enqueue
Listener
Receiver
Handler
Response
Dequeue
Responder
–  Listener, Receiver, Handlers,
Responders
•  Performing admission controls
on these event queues
•  High throughput through
maximally overlapping different
processing stages as well as
maintain default task-level
parallelism in Spark
Accept
Connection
Receive
Request
RDMA-based Shuffle Engine
Send
Response
Read Data
Block
Data
18
HotI 2014
RDMA-based Shuffle Engine
•  Connection Management
–  Alternative designs
•  Pre-connection
–  Hide the overhead in the initialization stage
–  Before the actual communication, pre-connect processes to each other
–  Sacrifice more resources to keep all of these connections alive
•  Dynamic Connection Establishment and Sharing
–  A connection will be established if and only if an actual data exchange
is going to take place
–  A naive dynamic connection design is not optimal, because we need to
allocate resources for every data block transfer
–  Advanced dynamic connection scheme that reduces the number of
connection establishments
–  Spark uses multi-threading approach to support multi-task execution in
a single JVM à Good chance to share connections!
–  How long should the connection be kept alive for possible re-use?
–  Time out mechanism for connection destroy
19
HotI 2014
RDMA-based Shuffle Engine
•  Data Transfer
–  Each connection is used by multiple tasks (threads) to
transfer data concurrently
–  Packets over the same communication lane will go to
different entities in both server and fetcher sides
–  Alternative designs
•  Perform sequential transfers of complete blocks over a
communication lane à Keep the order
–  Cause long wait times for some tasks that are ready to transfer data
over the same connection
•  Non-blocking and Out-of-order Data Transfer
– 
– 
– 
– 
– 
Chunking data blocks
Non-blocking sending over shared connections
Out-of-order packet communication
Guarantee both performance and ordering
Efficiently work with the dynamic connection management and sharing
mechanism
HotI 2014
20
RDMA-based Shuffle Engine
•  Buffer Management
–  On-JVM-Heap vs. Off-JVM-Heap Buffer Management
–  Off-JVM-Heap
•  High-Performance through native IO
•  Shadow buffers in Java/Scala
•  Registered for RDMA communication
–  Flexibility for upper layer design choices
•  Support connection sharing mechanism à Request ID
•  Support packet processing in order à Sequence number
•  Support non-blocking send à Buffer flag + callback
21
HotI 2014
Outline •  Introduc7on •  Problem Statement •  Proposed Design •  Performance Evalua7on •  Conclusion & Future work
22
HotI 2014
Experimental Setup
•  Hardware –  Intel Westmere Cluster (A) •  Up to 9 nodes •  Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz quad-­‐core CPUs, 24 GB main memory •  Mellanox QDR HCAs (32Gbps) + 10GigE –  TACC Stampede Cluster (B) •  Up to 17 nodes •  Intel Sandy Bridge (E5-­‐2680) dual octa-­‐core processors, running at 2.70GHz, 32 GB main memory •  Mellanox FDR HCAs (56Gbps) •  Soiware –  Spark 0.9.1, Scala 2.10.4 and JDK 1.7.0 –  GroupBy Test HotI 2014
23
Performance Evaluation on Cluster A
10
10GigE
IPoIB (32Gbps)
RDMA (32Gbps)
GroupBy Time (sec)
8
7
18%
6
5
4
3
2
GroupBy Time (sec)
9
10GigE
IPoIB (32Gbps)
RDMA (32Gbps)
8
20%
6
4
2
1
0
4
6
8
Data Size (GB)
0
10
8
12
16
Data Size (GB)
20
8 Worker Nodes, 64 Cores, (64M 64R) 4 Worker Nodes, 32 Cores, (32M 32R) •  For 32 cores, up to 18% over IPoIB (32Gbps) and
up to 36% over 10GigE
•  For 64 cores, up to 20% over IPoIB (32Gbps) and
34% over 10GigE
24
HotI 2014
Performance Evaluation on Cluster B
35
IPoIB (56Gbps)
RDMA (56Gbps)
83%
40
30
20
10
0
IPoIB (56Gbps)
RDMA (56Gbps)
30
GroupBy Time (sec)
GroupBy Time (sec)
50
79%
25
20
15
10
5
8
12
16
Data Size (GB)
0
20
16
24
32
Data Size (GB)
40
16 Worker Nodes, 256 Cores, (256M 256R) 8 Worker Nodes, 128 Cores, (128M 128R) •  For 128 cores, up to 83% over IPoIB (56Gbps)
•  For 256 cores, up to 79% over IPoIB (56Gbps)
25
HotI 2014
Performance Analysis on Cluster B
•  Java-based micro-benchmark comparison
Peak Bandwidth
IPoIB (56Gbps)
RDMA (56Gbps)
1741.46MBps 5612.55MBps •  Benefit of RDMA Connection Sharing Design
20
GroupBy Time (sec)
RDMA−w/o−Conn−Sharing (56Gbps)
RDMA−w−Conn−Sharing (56Gbps)
55%
15
10
5
0
16
24
32
Data Size (GB)
By enabling connection
sharing, achieve 55%
performance benefit for
40GB data size on 16
nodes
40
26
HotI 2014
Outline •  Introduc7on •  Problem Statement •  Proposed Design •  Performance Evalua7on •  Conclusion & Future work
27
HotI 2014
Conclusion and Future Work
•  Three major conclusions
–  RDMA and high-performance interconnects can
benefit Spark.
–  Plug-in based approach with SEDA-/RDMA-based
designs provides both performance and
productivity.
–  Spark applications can benefit from an RDMA
-enhanced design.
•  Future Work
–  Continuously update this package with newer designs and carry
out evaluations with more Spark applications on systems with
high-performance networks
–  Will make this design publicly available through the HiBD project
28
HotI 2014
Thank You! {luxi, rahmanmd, islamn, shankard, panda} @cse.ohio-­‐state.edu Network-­‐Based Compu7ng Laboratory hrp://nowlab.cse.ohio-­‐state.edu/
The High-­‐Performance Big Data Project hrp://hibd.cse.ohio-­‐state.edu/
HotI 2014