Spectrum Scale on the SoftLayer Cloud IBM Reference Architecture Version 1.4 December 6, 2016 Authors: Sam Bigger Table of Contents Document history ........................................................................................ 3 Revision history ....................................................................................................... 3 Introduction................................................................................................. 4 Business problem and Business value .......................................................... 4 Business problem ..................................................................................................... 4 Business value ......................................................................................................... 6 Requirements .............................................................................................. 7 Functional requirements ........................................................................................... 7 Architectural overview ................................................................................. 7 IBM Spectrum Scale Storage Cloud Service ................................................ 10 Component model ...................................................................................... 10 Operational model ..................................................................................... 12 © Copyright IBM Corporation 2014 Page 1 of 24 Deployment/management node ............................................................................... 13 Compute node ....................................................................................................... 13 Network connections .............................................................................................. 13 Spectrum Scale Storage Clusters in the SoftLayer Cloud ............................................. 14 Description of Spectrum Scale Storage DMSS Server configuration ............................... 19 The Spectrum Scale Storage Building Block ............................................................... 19 Active File Management (AFM) ................................................................................. 20 Building Block Limitations ....................................................................................... 20 Management software ............................................................................................ 20 Deployment considerations ........................................................................ 21 Resources .................................................................................................. 22 References ............................................................................................................ 22 Notices ....................................................................................................... 23 Trademarks ........................................................................................................... 23 © Copyright IBM Corporation 2014 Page 2 of 24 Document history Revision history Date of this revision: Monday, December 5, 2016 Revision Number V1.0 Revision Date 12/5/14 Date of next revision (date) Summary of Changes Updates from Spectrum Scale Storage to Spectrum Scale naming © Copyright IBM Corporation 2014 Changes marked No Page 3 of 24 Introduction This document describes the reference architecture for the IBM Spectrum Scale Storage (GPFS) on the SoftLayer Cloud Service, which is a remote cloud technical computing and cloud storage architecture that supports high speed remote storage using and parallel file system and also small HPC compute-storage architectures in the SoftLayer cloud. The target audience of this document includes: Architects in an IT or business organization who are responsible for designing, deploying and/or operating a cluster that would benefit from remote computation and storage capabilities that allow their clients to pay as they go, thereby removing the need for large upfront costs as well as the common early problems of having had to allocate compute and storage resources early on that are way beyond their current needs. Technically-skilled application users who are looking for an easy-to-deploy and easy-to-use highperformance computing (HPC) or technical cluster that increases the performance and capabilities for general HPC and Enterprise applications. Business Partners who are using the building blocks that are described in this document to create customized reference architectures. Business problem and Business value Business problem Both HPC and Enterprise Technical computing users are always faced with having to estimate how much storage they will need in the future. This usually translates to having to guess how much storage they will need in 5 years and then having to make a large capital investment upfront, to then only use perhaps as little as 10% of that storage the first year in production. So, in a manner of thinking, 90% of that large upfront investment is essentially going unused or is wasted for the first year and it may not be even 80% utilized until the fifth year. And that’s probably a generous estimate since most usually try to estimate on the high side of their storage needs so they don’t go over that 80% mark before they can buy additional storage. Since most operations do not want to get close to using all of their storage, there is often a good chance that way more storage was initially purchased than was ever needed. In today’s atmosphere of exploding data creation, as well as no one wanting to delete anything, it’s also common to have greatly underestimated the amount of storage needed several years down the road. Who can ever estimate even close to accurately the amount of storage needed several years in advance given today’s ever increasing amount storage needed and the changing characteristics of the type and size of that data? One fact remains, in either underestimating or overestimating the amount of storage needed several years ahead, in buying storage in large upfront chunks, there will always be wasted unused storage for likely many months after initially being placed in production. Perhaps the most significant problem Spectrum Scale Storage on the SoftLayer Cloud was meant to solve was that the scalability of NFS in the cloud, in terms of both size and performance. With the explosion of data use in the past few years and the projected 800% increase over the next couple of years, the use of NAS filers and systems using NFS servers for data storage and backups is quickly approaching their © Copyright IBM Corporation 2014 Page 4 of 24 maximum limits. NFS just can’t handle more than 100TB effectively, so performance drops significantly once they approach 100TB. This leaves the customer with the need to bring in another NAS filer to provide another 100TB of storage, but they are forced to also treat each NAS filer as a separate file system, when they would much rather just be able to add another 100TB, as needed, to their first file system. If their storage needs grow to needing 2PB, most would rather have one big 2PB global namespace file system, as opposed to having to have 20 separate file systems and mount points to manage and balance data across. Figure 1 illustrates how NFS in the SoftLayer cloud maxes out, in terms of bandwidth, at about 34 servers and 500 cores. Figure 1: Scalability Concerns with NFS in SoftLayer Another pain point for many customers is the lack of physical space to grow. Crowded labs, and or old labs unable to accommodate today’s power and cooling requirements can many times be the most difficult limitations to overcome. Therefore, it would be desirable if one could have a very large amount of data under a single namespace. It would also be nice if one did not have to make impossible guesses on the amount of storage needed many years in advance, nor have to pay that large initial capital investment for the unused amounts of that storage, much of which may not be needed for years to come. And that’s the best case scenario. Running © Copyright IBM Corporation 2014 Page 5 of 24 out of storage too soon is usually an even less desirable problem to have for most companies. One problem is that most file systems performance dramatically drops once the file system goes over 75-80% full. It would be desirable if storage could be allocated and de-allocated on an as-needed basis and on a pay as you go basis, such that one is only paying for the amount of storage they are using at any particular point in time. Business value IBM Spectrum Scale Storage on the Cloud Service for general HPC and Enterprise storage needs provides an expertly-designed, tightly-integrated, and performance-optimized architecture for these environments and other applications as a managed cloud service by IBM. The following is the approach deploying Spectrum Scale Storage on the SoftLayer Cloud as part of a managed service offering: Spectrum Scale Storage as a Cloud based service • Allows the customer to have little or no infrastructure on premise • Different sales approach – essentially selling a managed service • Customer can maintain a persistent infrastructure storage and compute capacity, and flexibly expand up or shrink down as required • Customers can pay as they go or need more capacity on a monthly basis. No more guessing at future storage needs. Grow right when you need to. Pay for what you are using at the time and not for what you think you will need several years from now. No long term contracts. • No more concerns over providing enough space, cooling, or power for your new compute nodes or storage nodes and disks. This is all included in the monthly price without a long term contract. • Customers can have one single namespace and not have to implement another NFS NAS Filer for every 100 TB used. • Provides Platform managed scheduling services via LSF or Symphony. These come included in the monthly price. These services are an option and there is no requirement to use them. • Unlimited Server and Client Licenses are included in the monthly price. No more hassles with paying by the core or socket counts. • Grow storage on an as needed basis in 100 TB usable chunks up to 1PB currently until higher amounts are validated. • Storage is fully replicated and Highly Available. • Storage performance grows linearly with each 100 TB chunk added – inside the cloud. • Installation is included – installed, integrated & administered by skilled cloud ops team. © Copyright IBM Corporation 2014 Page 6 of 24 • Optimal Security – deployed on isolated dedicated bare metal resources, compute nodes, servers, and storage hardware at a named data center for optimal I/O performance & security. No noisy neighbor problems as those that exist in other clouds. • Optimized for Technical Computing & Analytics workloads • Advanced features include Active File Management (AFM) for replication across multiple data centers. The business value is in the infrastructure cost savings and providing business agility. Using the cloud as a vehicle to deliver compute and/or Spectrum Scale Storage as a service provides customers with a degree of financial flexibility heretofore not provided as a “traditional” option through corporate IT. An Spectrum Scale HPC cluster can now be stood up in the cloud and rented for a specific period of time, essentially using a “pay-for-use” model and then subsequently dismantled. No longer will there be a need to maintain and support a dedicated cluster and the associated costs. Requirements The following functional requirements provide a high-level overview of the desired characteristics of the cluster architecture. Functional requirements Architectural overview In technical computing, there are three major use cases for centralized computing resources: Workstations plus a cluster: Figure 2 demonstrates a typical use case for a back-end highperformance cluster. In this case, a user prepares data on the workstation, then submits a computational job to leverage the performance and throughput of the cluster. After the job has completed, the results are viewed on a client workstation. The Spectrum Scale Storage file system is shared across all workstations and the cluster so that the application running on the cluster can access any required data. This is typically a situation where all the hardware is located at the customer’s site. © Copyright IBM Corporation 2014 Page 7 of 24 Figure 2. Technical computing use case 1 Thin clients with a cluster: Figure 3 represents this model. Instead of using workstations to prepare jobs and handle post-processing, users are on a thin client. Compute-intensive simulations and graphics-intensive visualizations run in the cluster environment, while terminal and graphic output is directed to the thin client. © Copyright IBM Corporation 2014 Page 8 of 24 Figure 3. Technical computing use case 2 Hybrid Clusters: Figure 4 illustrates a hybrid cluster with external shared file system support where the shared file system is not part of the customer’s main cluster, but serves as a backup system to the main system. This has traditionally been supported by NFS appliances, which may or may not be physically on site with the customer’s main cluster. This figure illustrates a local, on premise, GPFS file system being backed up to the SoftLayer cloud for replication and/or disaster recovery purposes. The representation of the SoftLayer Cloud in this illustration could include clusters in multiple SoftLayer data centers spread across the world for extended high end disaster recovery models. Hybrid clusters essentially mean that work or data is being shared. So, from the perspective of using LSF for scheduling jobs with Spectrum Scale Storage, hybrid would mean that the compute nodes at the customer’s site and those inside SoftLayer can share jobs between them. If the compute nodes between the sites cannot be shared then it is not a hybrid model. For the Hybrid model with regards to the Spectrum Scale Storage in SoftLayer, a hybrid is one where data can be shared between the customer’s site GPFS storage and the Spectrum Scale Storage on the SoftLayer cloud storage, usually using replication, such as in a stretched cluster, or using AFM. © Copyright IBM Corporation 2014 Page 9 of 24 Figure 4: Archiving /Disaster Recovery Hybrid Cloud Storage IBM Spectrum Scale Storage Cloud Service The IBM Spectrum Scale Storage on the SoftLayer Cloud service will often be built on top the IBM Platform LSF Cloud Service; that is, first the IBM Platform LSF cloud service will be setup and then the additional storage components will be deployed to the environment. However, while customers are encouraged to use the LSF or Symphony features, there is no requirement to do so. One can mount the Spectrum Scale Storage parallel file system directly if so desired. Component model Figure 5 illustrates the architecture of the cluster, which addresses the requirements of use case 3. © Copyright IBM Corporation 2014 Page 10 of 24 Figure 5. Hybrid Cluster components The software components collectively support a wide variety of computationally-intensive applications running over a cluster. In order to support such applications, the software shown in Figure 5 must provide a number of services. Before being able to run any application software, all of the nodes must be installed with the operating system and any application-specific software. This function is provided by the provisioning engine. Here, the user creates or uses a predefined template that describes the desired characteristics of the compute node software. This provisioning engine listens for boot requests over a selected network and installs the system with the desired operating system and application software. After the installation is complete, the target systems are eligible to run applications. Although the compute images are able to run application software, access to these images is generally controlled by the master scheduler. This scheduler function ensures that computational resources on the compute engines are not overused by serializing access to them. The properties of the master scheduler are generally defined during installation setup. The scheduler can be configured to allocate different workloads to be submitted to a subset of the job placement agents. This job placement agent starts particular workloads at the request of the master scheduler. There are multiple job placement agents on the system, one on each of the operating system images. The monitoring and resource agents report back to the provisioning agent and master scheduler the state of the system on every operating system image. This provides a mechanism to provide alerts when there © Copyright IBM Corporation 2014 Page 11 of 24 is a problem and to ensure jobs are only scheduled on operating system images that are available and have resources. The administration portal provides an easy-to-use mechanism to control and monitor the overall cluster, while the user portal provides easy-to-use access to the system. Operational model The software components described in the previous sections are implemented on the following hardware components: The management node provides the user and administration portal functions. These portal functions are supported by the monitoring, provisioning, and scheduling functions that are also run on the same node. There are one or two management nodes. Compute nodes provide the monitoring and job placement agents. When the user requests batch execution, the agents spawn the appropriate applications on these physical nodes. Spectrum Scale Storage on the cloud can support up to hundreds of compute nodes. This number varies between data centers, depending upon available space. Figure 6. Implementation of software components on physical hardware © Copyright IBM Corporation 2014 Page 12 of 24 These nodes are connected by a high-speed network and a shared file system. The mapping of the software functions are shown in Figure 6. Each of the components shown in Figure 6 is described in detail in the subsequent sections. The management network and provisioning networks are often the same network on smaller systems. Deployment/management node The management node functions as a deployment node at the user site and contains all of the software components that are required for running the application in the cluster. After the management node is connected to a cluster of nodes, it provisions and deploys the compute nodes with client software. The software installed on the management node includes: Operating system (OS): Red Hat Enterprise Linux (RHEL) Server Operating system repository: RHEL repository for deploying the cluster nodes Workload management for scheduling jobs and tasks in the cluster environment: IBM® Platform™ LSF or IBM® Platform™ Symphony The head or management node is the base of a clustered system using the architecture defined in the document and is supported in single and high-availability (HA) configurations. Compute node For those cases where there are compute nodes as part of the configuration, the compute nodes are architected for computationally-intensive applications. They have some local disk for the OS and temporary storage that is used by running applications. The compute nodes also mount the shared Spectrum Scale Storage parallel file system using the Spectrum Scale Storage protocols. In addition to the operating system, the application and runtime libraries are installed on all compute nodes. The monitoring and resource management agents are connected to the cluster management software and the workload management software is installed. Network connections There are a number of networks used in a cluster. Each of them can have a dedicated network or can share a common network with others. Management network: This network allows the management software to manage and monitor the hardware without the operating system on a 1-Gigabit Ethernet interconnect. Typical hardware-level management functions include: power-cycling the node, hardware status monitoring, firmware configuration, and hardware console access. Provisioning network: This network is used for provisioning the operating system and deploying software components and applications. This network is also used for monitoring and workload management and can use a 1- or 10-Gigabit Ethernet interconnect. © Copyright IBM Corporation 2014 Page 13 of 24 Application network: This network is used mainly by applications, particularly for communication among different tasks within an application across multiple nodes. This network is also used as a data path for applications to access the shared storage. The application network uses a 10-Gigabit Ethernet interconnect. Intrasite network: This network would represent a special network between sites, typically used to do remote replication of the local data. These networks can be combined into one or two physical networks to minimize the network cost and cabling in some configurations. A typical deployment could be: Combined management network and provisioning network, plus a dedicated high-speed interconnect for applications and access to shared storage. Combined provisioning network and application network by using 10-Gigabit Ethernet, plus a dedicated management network. Both deployment options are available in the IBM Spectrum Scale Storage Cloud Service. Spectrum Scale Storage Clusters in the SoftLayer Cloud The Spectrum Scale Storage in the SoftLayer cloud is provided using a non-shared storage paradigm. This means that all of the hardware, i.e. disks, enclosures, servers, compute nodes (if any) is all physical bare metal hardware, meaning no virtual machines. The network is also non-shared in that everything for each customer is isolated with a private vlan. The servers and disks are isolated too so that one does not encounter the noisy neighbor problems that exist in the other popular cloud implementations. Spectrum Scale Storage on the SoftLayer cloud is a fully integrated solution that includes all server and client licenses, as well as installation, support, and maintenance of the Spectrum Scale Storage environment. © Copyright IBM Corporation 2014 Page 14 of 24 Figure 7: SoftLayer storage hardware is physical non-shared hardware The Spectrum Scale Storage on the SoftLayer cloud service is a managed service that is provided as part of a complete hybrid or public cloud implementation. IBM® Platform™ LSF or IBM® Platform™ Symphony is provided for job scheduling and included in the monthly price. However, there is no requirement for customers to use either of these. © Copyright IBM Corporation 2014 Page 15 of 24 © Copyright IBM Corporation 2014 Page 16 of 24 The Spectrum Scale Storage HPC Cluster on the SoftLayer cloud provides the ability to the customer to have a fairly large scale supercomputer completely contained and managed within the SoftLayer cloud. Such HPC clusters can support up to hundreds of compute nodes which are connected directly via some form of private interconnect (10Gbe now, possibly InfiniBand in the future) to the Spectrum Scale Storage parallel file system. There is currently a soft limit on the maximum amount of storage supported at 1PB until a larger amount has been validated. This is simply the largest configuration that has been tested to date. Nevertheless, several hundred compute nodes and 1PB of storage is still a reasonably large setup by any standards today, especially for enterprise applications. Figure 8: Spectrum Scale Storage HPC Cluster on SoftLayer Cloud The initial target customer of Spectrum Scale Storage on the SoftLayer cloud are those for whom NFS is becoming a significant bottleneck. NFS is not able to scale past 100 TB in size before performance starts to dramatically fall off. This leaves customers with the problem of having to install an additional NFS appliance everytime they want to add another 100 TB. This process forces the customer to have a separate file system and mount points for each appliance. Most customers want to be able to have all of their storage under one single namespace or mount point. They do not want to have a separate file system and mount point for every 100 TB of storage they add. This problem is exacerbated for many customers who most likely also have the problem of distributing and balancing the current 100 TB of data over the newly acquired empty 100 TB of space. This is rarely a simple and easy task because data is not generally nicely separated into even chunks. So, if they eventually need 1PB of storage, most are not going to be happy having to deal with 10 different mount points and constantly having the headaches associated with © Copyright IBM Corporation 2014 Page 17 of 24 keeping the data balanced between 10 different file sytems. Spectrum Scale Storage on the SoftLayer cloud solves these problems by allowing a single namespace, single mount point for very large amounts of data. Spectrum Scale Storage is also able to automatically evenly redistribute the data on previous storage across the newly added storage. When we automatically, one does have issue a command to tell it to do so, but then it is all automatically done without any further input needed from the customer. Figure 9: Spectrum Scale Storage on SoftLayer Cloud (Storage only) © Copyright IBM Corporation 2014 Page 18 of 24 Description of Spectrum Scale Mass Storage Server configuration Figure 10: Spectrum Scale Storage Building Blocks The Spectrum Scale Storage Building Block The Spectrum Scale Storage on the SoftLayer cloud offering is made up of one or more strict set of building blocks. Each building block provides 100 TB of usable storage space. The term “usable” means after all RAID formatting and replication has been taken into consideration. There is actually about 240 TB of raw physical disk in each building block. Anytime a customer wants to add additional storage they have to do it by adding one or more 100 TB building blocks. Each building block is made up of 2 Dynamic Mass Storage Servers (DMSS), which also act as NSD servers the Spectrum Scale Storage (GPFS) parallel file system. Each DMSS server has internal DAS storage of 36 drives. Two drives are 300 GB SAS 10K drives which are mirrored (RAID1) and used for the servers operating system which is currently CentOS 6.x . In addition, there are 34 4TB SATA drives, four of which are used as hot spares, leaving 30 to be used as data disks. The 30 data drives are configured into 3 8+2 RAID6 volumes. So, each DMSS/NSD server has 3 volumes each and when we create the file system across both servers, these 6 volumes are formatted using RAID10 at the file system level. For further data protection, the file system is setup with double replication, meaning that for every write to one DMSS server, a replica of that write is also stored on the other server. This © Copyright IBM Corporation 2014 Page 19 of 24 replication is synchronous, so the applications are not notified that the write has completed until both the initial write and its replica have completed. Active File Management (AFM) Active File Management is a feature that allows for asynchronous replication of an Spectrum Scale Storage (GPFS) file system across data centers. If a customer already has a GPFS cluster at their own site, then that cluster can use AFM in conjunction with an Spectrum Scale Storage cluster in the Cloud to replicate their on premise site data to a cluster in the cloud. Furthermore, since SoftLayer has many data centers around the world, they are also able to use AFM to replicate data to other SoftLayer data centers for disaster recovery protection. Building Block Limitations While the Spectrum Scale Storage on Cloud configuration can be termed as being highly available, it does have some limitations. First, we commonly refer to an HA system as having no single points of failure. However, the building block version of Spectrum Scale Storage in the Cloud does have one and that is the DMSS servers themselves. Since the disks reside in the servers, the loss of a server results in the the loss of access to that servers drives. So, read access to the data will still work fine because all of the current data is available on the other server that is still running, and writes to the remaining server will complete and return I/O success to the user and log that any replicas written while the other server is down will need to be updated when it recovers. This is simply because it is unable in that case to write the replica. Therefore, while one DMSS server of a pair is down, no more writes can be done to that part of the file system until it is repaired or replaced. When the dead server is restored, the disks which need to be updated are marked as recovering and so only writes are allowed to the recovering servers disks that are I the recovering state. No reads are allowed while they are in this state because the data might be stale in comparison to the other servers copy of the replica. Currently 1PB or 20 DMSS servers is the maximum currently validated, so it’s really a soft limit. We only support monthly contracts. There are no hourly or weekly rentals allowed. Management software The management software includes cluster management, workload management, a runtime library, and a file system. © Copyright IBM Corporation 2014 Page 20 of 24 Deployment considerations Because the architecture is optimized for a specific application and most software deployments are fully automated, a cluster can be deployed in a day or two. However, some configurations are heavily dependent on the data center selected. Users can take advantage of the performance and capability that the cluster delivers with minimal training. © Copyright IBM Corporation 2014 Page 21 of 24 Resources References IBM Application Ready Solutions for Technical Computing: https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=stgweb&S_PKG=ov20169 © Copyright IBM Corporation 2014 Page 22 of 24 Notices References in this document to IBM products or services do not imply that IBM intends to make them available in every country. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. Trademarks IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. © Copyright IBM Corporation 2014 Page 23 of 24 © Copyright IBM Corporation 2014 Page 24 of 24
© Copyright 2026 Paperzz