迈克杰克逊相信医生:High-performance Linux clustering, Part 1: Clustering fundamentals

来源:百度文库 编辑:九乡新闻网 时间:2024/04/26 01:41:48

High-performance Linux clustering, Part 1: Clustering fundamentals

Introducing the basic concepts of High Performance Computing with Linux cluster technology

Aditya Narayan (aditya_pda@hotmail.com), Founder, QCD Microsystems

Summary:  High Performance Computing (HPC) hasbecome easier, and two reasons are the adoption of open source softwareconcepts and the introduction and refinement of clustering technology.This first of two articles discusses the types of clusters available,uses for those clusters, reasons clusters have become popular for HPC,some fundamentals of HPC, and the role of Linux? in HPC.

Tag this!Update My dW interests (Log in | What's this?) Skip to help for Update My dW interests

Date:  27 Sep 2005
Level:  Introductory
Also available in:  Japanese

Activity:  27607 views
Comments:   2 (View | Add comment - Sign in)

Average rating (247 votes)
Rate this article

Linux clustering is popular in many industries these days. With theadvent of clustering technology and the growing acceptance of opensource software, supercomputers can now be created for a fraction of thecost of traditional high-performance machines.

This two-part article introduces the concepts of High PerformanceComputing (HPC) with Linux cluster technology and shows you how to setup clusters and write parallel programs.This part introduces the different types of clusters, uses of clusters,some fundamentals of HPC, the role of Linux, and the reasons for thegrowth of clustering technology. Part 2 covers parallel algorithms, howto write parallel programs, how to set up clusters, and clusterbenchmarking.

Types of HPC architectures

Most HPC systems use the concept of parallelism. Many software platforms are oriented for HPC, but first let's look at the hardware aspects.

HPC hardware falls into three categories:

  • Symmetric multiprocessors (SMP)
  • Vector processors
  • Clusters

Symmetric multiprocessors (SMP)

SMP is a type of HPC architecture in which multiple processors share the same memory. (In clusters, also known as massively parallel processors (MPPs),they don't share the same memory; we will look at this in more detail.)SMPs are generally more expensive and less scalable than MPPs.

Vector processors

In vector processors, the CPU is optimized to perform well with arraysor vectors; hence the name. Vector processor systems deliver highperformance and were the dominant HPC architecture in the 1980s andearly 1990s, but clusters have become far more popular in recent years.

Clusters

Clusters are the predominant type of HPC hardware these days; a cluster is a set of MPPs. A processor in a cluster is commonly referred to as a nodeand has its own CPU, memory, operating system, and I/O subsystem and iscapable of communicating with other nodes. These days it is common touse a commodity workstation running Linux and other open source softwareas a node in a cluster.

You'll see how these types of HPC hardware differ, but let's start with clustering.

Clustering, defined

The term "cluster" can take different meanings in different contexts. This article focuses on three types of clusters:

  • Fail-over clusters
  • Load-balancing clusters
  • High-performance clusters

Fail-over clusters

The simplest fail-over cluster has two nodes: one stays active and theother stays on stand-by but constantly monitors the active one. In casethe active node goes down, the stand-by node takes over, allowing amission-critical system to continue functioning.

Load-balancing clusters

Load-balancing clusters are commonly used for busy Web sites whereseveral nodes host the same site, and each new request for a Web pageis dynamically routed to a node with a lower load.

High-performance clusters

These clusters are used to run parallel programs for time-intensivecomputations and are of special interest to the scientific community.They commonly run simulations and otherCPU-intensive programs that would take an inordinate amount of time torun on regular hardware.

Figure 1 illustrates a basic cluster. Part 2 of this series shows you how to create such a cluster and write programs for it.


Figure 1. The basic cluster

Grid computing is a broad term that typically refers to aservice-oriented architecture (SOA) with collaboration among looselycoupled systems. Cluster-based HPC is a special case of grid computingin which the nodes are tightly coupled. A successful, well-known projectin grid computing is SETI@home, the Search for ExtraterrestrialIntelligence program, which used the idle CPU cycles of a million homePCs via screen savers to analyze radio telescope data. A similarsuccessful project is the Folding@Home project for protein-foldingcalculations.

Common uses of high-performance clusters

Almost every industry needs fast processing power. With the increasingavailability of cheaper and faster computers, more companies areinterested in reaping the technological benefits. There is no upperboundary to the needs of computer processing power; even with the rapidincrease in power, the demand is considerably more than what'savailable.

Life sciences research

Proteins molecules are long flexible chains that can take on a virtuallyinfinite number of 3D shapes. In nature, when put in a solvent, theyquickly "fold" to their native states. Incorrect folding is believed tolead to a variety of diseases like Alzheimer's; therefore, the study ofprotein folding is of fundamental importance.

One way scientists try to understand protein folding is by simulating iton computers. In nature, folding occurs quickly (in about one millionthof a second), but it is so complex that its simulation on a regularcomputer could take decades. This focus area is a small one in anindustry with many more such areas, but it needs serious computationalpower.

Other focus areas from this industry include pharmaceutical modeling,virtual surgery training, condition and diagnosis visualizations,total-record medical databases, and the Human Genome Project.

Oil and gas exploration

Seismograms convey detailed information about the characteristics of theinterior of the Earth and seafloors, and an analysis of the data helpsin detecting oil and other resources. Terabytes of data are needed toreconstruct even small areas; this analysis obviously needs a lot ofcomputational power. The demand for computational power in this field isso great that supercomputing resources are often leased for this work.

Other geological efforts require similar computing power, such asdesigning systems to predict earthquakes and designing multispectralsatellite imaging systems for security work.

Graphics rendering

Manipulating high-resolution interactive graphics in engineering, suchas in aircraft engine design, has always been a challenge in terms ofperformance and scalability because of the sheer volume of datainvolved. Cluster-based techniques have been helpful in this area wherethe task to paint the display screen is split among various nodes of thecluster, which use their graphics hardware to do the rendering fortheir part of the screen and transmit the pixel information to a masternode that assembles the consolidated image for the screen.

These industry examples are only the tip of the iceberg; many moreapplications including astrophysical simulation, weather simulation,engineering design, financial modeling, portfolio simulation, andspecial effects in movies demand extensive computational resources. Thedemand for increasingly more power is here to stay.

How Linux and clusters have changed HPC

Before cluster-based computing, the typical supercomputer was a vectorprocessor that could typically cost over a million dollars due to thespecialized hardware and software.

With Linux and other freely available open source software componentsfor clustering and improvements in commodity hardware, the situation nowis quite different. You can build powerful clusters with a very smallbudget and keep adding extra nodes based on need.

The GNU/Linux operating system (Linux) has spurred the adoption ofclusters on a large scale. Linux runs on a wide variety of hardware, andhigh-quality compilers and other software like parallel filesystems andMPI implementations are freely available for Linux. Also with Linux,users have the ability to customize the kernel for their workload.Linux is a recognized favorite platform for building HPC clusters.

Understanding hardware: Vector vs. cluster

To understand HPC hardware, it is useful to contrast the vectorcomputing paradigm with the cluster paradigm. These are competingtechnologies (Earth Simulator, a vector supercomputer, is still among the top 10 fastest supercomputers).

Fundamentally, both vector and scalar processors execute instructionsbased on a clock; what sets them apart is the ability of vectorprocessors to parallelize computations involving vectors (such as matrixmultiplication), which are commonly used in High Performance Computing.To illustrate this, suppose you have two double arrays a and b and want to create a third array x such that x[i]=a[i]+b[i].

Any floating-point operation like addition or multiplication is achieved in a few discrete steps:

  • Exponents are adjusted
  • Significands are added
  • The result is checked for rounding errors and so on

A vector processor parallelizes these internal steps by using a pipelining technique. Suppose there are six steps (as in IEEE arithmetic hardware) in a floating-point addition as shown in Figure 2.


Figure 2. A six-step pipeline with IEEE arithmetic hardware

A vector processor does these six steps in parallel -- if the ith array elements being added are in their 4th stage, the vector processor will execute the 3rd stage for the (i+1)th element, 2nd stage for the (i+2)th,and so on. As you can see, for a six-step floating-point addition, thespeedup factor will be very close to six times (at start and end, notall six stages are active) as all stages are active at any given instant(shown in red in Figure 2). A big benefit is that parallelization ishappening behind the scenes, and you need not explicitly code this inyour programs.

For the most part, all six steps can be executed in parallel leading toan almost six-fold improvement in performance. The arrows indicate theoperations on the ith array index.

Compared to vector processing, cluster-based computing takes afundamentally different approach. Instead of using specially optimizedvector hardware, it uses standard scalar processors, but in largenumbers to perform several computations in parallel.

Some features of clusters are as follows:

  • Clusters are built using commodity hardware and cost a fraction of the vector processors. In many cases, the price is lower by more than an order of magnitude.
  • Clusters use a message-passing paradigm for communication, and programs have to be explicitly coded to make use of distributed hardware.
  • With clusters, you can add more nodes to the cluster based on need.
  • Open source software components and Linux lead to lower software costs.
  • Clusters have a much lower maintenance cost (they take up less space, take less power, and need less cooling).

Parallel programming and Amdahl's Law

Software and hardware go hand in hand when it comes to achieving highperformance on a cluster. Programs must be written to explicitly takeadvantage of the underlying hardware, and existing non-parallel programsmust be re-written if they are to perform well on a cluster.

A parallel program does many things at once. Just how many depends onthe problem at hand. Suppose 1/N of the total time taken by a program isin a part that can not be parallelized, and the rest (1-1/N) is in theparallelizable part (see Figure 3).


Figure 3. Illustrating Amdahl's Law

In theory you could apply an infinite amount of hardware to do theparallel part in zero time, but the sequential part will see noimprovements. As a result, the best you can achieve is to execute theprogram in 1/N of the original time, but no faster. In parallelprogramming, this fact is commonly referred to as Amdahl's Law.

Amdahl's Law governs the speedup of using parallel processors on a problem versus using only one serial processor. Speedupis defined as the time it takes a program to execute in serial (withone processor) divided by the time it takes to execute in parallel (withmany processors):

     T(1)            S = ------            T(j)            

Where T(j) is the time it takes to execute the program when using j processors.

In Figure 3, with enough nodes for parallel processing, T'par can bemade close to zero but Tseq does not change. At best, the parallelversion cannot be faster than the original by more than 1+Tpar/Tseq.

The real hard work in writing a parallel program is to make N as largeas possible. But there is an interesting twist to it. You normallyattempt bigger problems on more powerful computers, and usually theproportion of the time spent on the sequential parts of the codedecreases with increasing problem size (as you tend to modify theprogram and increase the parallelizable portion to optimize theavailable resources). Therefore, the value of N automatically becomeslarge. (See the re-evaluation of Amdhal's Law in the Resources section later in this article.)

Approaches to parallel programming

Let's look at two parallel programming approaches: the distributed memory approach and the shared memory approach.

Distributed memory approach

It is useful to think a master-slave model here:

  • The master node divides the work between several slave nodes.
  • Slave nodes work on their respective tasks.
  • Slave nodes intercommunicate among themselves if they have to.
  • Slave nodes return back to the master.
  • The master node assembles the results, further distributes work, and so on.

Obvious practical problems in this approach stem from thedistributed-memory organization. Because each node has access to onlyits own memory, data structures must be duplicated and sent over thenetwork if other nodes want to access them, leading to network overhead.Keep these shortcomings and the master-slave model in mind in order towrite effective distributed-memory programs.

Shared memory approach

In the shared-memory approach, memory is common to all processors (suchas SMP). This approach does not suffer from the problems mentioned inthe distributed-memory approach. Also, programming for such systems iseasier since all the data is available to all processors and is not muchdifferent from sequential programming. The big issue with these systemsis scalability: it is not easy to add extra processors.

Parallel programming (like all programming) is as much art as science,always leaving room for major design improvements and performanceenhancements. Parallel programming has its own special place incomputing; Part 2 of this series examines parallel programming platformsand examples.

When file I/O becomes a bottleneck

Some applications frequently need to read and write large amounts ofdata to the disk, which is often the slowest step in a computation.Faster hard drives help, but there are times when they are not enough.

The problem becomes especially pronounced if a physical disk partitionis shared between all nodes (using NFS, for example) as is popular inLinux clusters. This is where parallel filesystems come in handy.

Parallel filesystems spread the data in a file over several disksattached to multiple nodes of the cluster, known as I/O nodes. When aprogram tries to read a file, small portions of that file are read fromseveral disks in parallel. This reduces the load on any given diskcontroller and allows it to handle more requests. (PVFS is a goodexample of an open source parallel filesystem; disk performance ofbetter than 1 GBsec has been achieved on Linux clusters using standardIDE hard disks.)

PVFS is available as a Linux kernel module and can also be built intothe Linux kernel. The underlying concept is simple (see Figure 4):

  • A meta-data server stores information on where different parts of the file are located.
  • Multiple I/O nodes store pieces of the file (any regular underlying filesystem like ext3 can be used by PVFS).

Figure 4. How PVFS works

When a compute node in a cluster wants to access a file in this parallel filesystem, it goes through the following steps:

  • It requests a file as usual, and the request goes to the underlying PVFS filesystem.
  • PVFS sends a request to the meta-data server (steps 1 and 2 in Figure 4), which informs the requesting node about the location of the file among the various I/O nodes.
  • Using this information, the compute node communicates directly with all the relevant I/O nodes to get all the pieces of the file (step 3).

All this is transparent to the calling application; the underlyingcomplexity of making requests to all the I/O nodes, ordering of thefile's contents, etc., are taken care of by PVFS.

A good thing about PVFS is that it allows binaries meant for regularfilesystems to run on it without modification -- this is somewhat of anexception in the parallel programming arena. (Some other parallelfilesystems are mentioned in Resources.)


Resources

Learn

  • Build your own cluster:
    • Building a Linux HPC Cluster with xCAT (IBM Redbook, September 2002) guides system architects and engineers toward a basic understanding of cluster technology, terminology, and the installation of a Linux High Performance Computing (HPC) cluster.
    • Linux HPC Cluster Installation (IBM Redbook, June 2001) focuses on xCAT (xCluster Administration Tools) for installation and administration.
    • "Building an HPC Cluster with Linux, pSeries, Myrinet, and Various Open Source Software" (IBM Redpaper, July 2004) explains how to bring a collection of pSeries nodes and a Myrinet interconnect to a state where a true HPC production workload can be run in a Linux clustered environment.
    • "Build a heterogeneous cluster with coLinux and openMosix" (developerWorks, February 2005) demonstrates how to build a mixed or hybrid HPC Linux cluster.
    • Linux Clustering with CSM and GPFS (IBM Redbook, January 2004) focuses on the Cluster System Management and the General Parallel Filesystem for Linux clusters.

  • Visit the SETI Project, and learn about SETI@home.

  • For more on protein folding, visit the Folding@Home Project, and get an introduction to protein folding.

  • Learn about IBM's Deep Computing Capacity on Demand, computing centers designed to handle compute-intensive tasks for petroleum and other industries.

  • John L. Gustafson explains a re-evaluation of Amdhal's Law.

  • Learn how to to achieve a gigabyte-per-second I/O throughput with commodity IDE disks using the PVFS.

  • The five-part series, "Build a digital animation system" (developerWorks, July 2005), showcases how to set up an HPC cluster you can use to run a professional animation system.

  • Find more resources for Linux developers in the developerWorks Linux zone.

  • To learn more about Grid computing, read New to Grid computing and check out the Grid zone on developerWorks.

Get products and technologies

  • Download the Parallel Virtual Filesystem.

  • Download xCAT on IBM alphaWorks, a tool kit for deploying and administering Linux clusters.

  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2?, Lotus?, Rational?, Tivoli?, and WebSphere?.

  • Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.


Discuss

  • Get involved in the developerWorks community by participating in developerWorks blogs.


About the author

AdityaNarayan holds a BS/MS in Physics from Indian Institute of Technology,Kanpur, and founded QCD Microsystems. He has expertise in Windows andLinux internals, WebSphere, and enterprise platforms like J2EE and .NET.Aditya spends most of his time in New York. You can reach him at aditya_pda@hotmail.com.