过滤烟嘴哪个牌子好:High-performance Linux clustering

来源:百度文库 编辑:九乡新闻网 时间:2024/05/02 17:52:28

Source Origin from IBM: Part1 Part2

High-performancecomputing (HPC) has become easier, and two reasons are the adoption ofopen source software concepts and the introduction and refinement ofclustering technology.

by Aditya Narayan,founder, QCD Microsystems. First published by IBM at IBM developerWorksLinux (www.ibm.com/developer/Linux). All rights retained by IBM and theauthor.
High-performance computing (HPC) has become easier, andtwo reasons are the adoption of open source software concepts and theintroduction and refinement of clustering technology. This first of twoarticles discusses the types of clusters available, uses for thoseclusters, the reasons clusters have become popular for HPC, somefundamentals of HPC, and the role of Linux in HPC.

Clustering fundamentals

Linux clustering is popular in many industries these days. With theadvent of clustering technology and the growing acceptance of opensource software, supercomputers can now be created for a fraction ofthe cost of traditional high-performance machines.

This two-part article introduces the concepts ofhigh-performance computing (HPC) with Linux cluster technology andshows you how to set up clusters and write parallel programs.This part introduces the different types of clusters, uses of clusters,some fundamentals of HPC, the role of Linux, and the reasons for thegrowth of clustering technology. Part 2 covers parallel algorithms, howto write parallel programs, how to set up clusters, and clusterbenchmarking.

Types of HPC architectures
Most HPC systems use the concept of parallelism. Many software platforms are oriented for HPC, but first let's look at the hardware aspects.

HPC hardware falls into three categories:

  • Symmetric multiprocessors (SMP)
  • Vector processors
  • Clusters
Symmetric multiprocessors (SMP):
SMP is a type of HPC architecture in which multiple processors share the same memory. (In clusters, also known as massively parallel processors (MPPs),they don't share the same memory; we will look at this in more detail.)SMPs are generally more expensive and less scalable than MPPs.

Vector processors:
In vector processors, the CPU is optimizedto perform well with arrays or vectors; hence the name. Vectorprocessor systems deliver high performance and were the dominant HPCarchitecture in the 1980s and early 1990s, but clusters have become farmore popular in recent years.

Clusters:
Clusters are the predominant type of HPC hardware these days; a cluster is a set of MPPs. A processor in a cluster is commonly referred to as a nodeand has its own CPU, memory, operating system, and I/O subsystem and iscapable of communicating with other nodes. These days it is common touse a commodity workstation running Linux and other open sourcesoftware as a node in a cluster.

You'll see how these types of HPC hardware differ, but let's start with clustering.

Clustering, defined
The term "cluster" can take different meanings in different contexts. This article focuses on three types of clusters:

  • Fail-over clusters
  • Load-balancing clusters
  • High-performance clusters
Fail-over clusters:
The simplest fail-over cluster has twonodes: one stays active and the other stays on stand-by but constantlymonitors the active one. In case the active node goes down, thestand-by node takes over, allowing a mission-critical system tocontinue functioning.

Load-balancing clusters:
Load-balancing clusters are commonlyused for busy Web sites where several nodes host the same site, andeach new request for a Web page is dynamically routed to a node with alower load.

High-performance clusters:
These clusters are used to runparallel programs for time-intensive computations and are of specialinterest to the scientific community. They commonly run simulations andother CPU-intensive programs that would take an inordinate amount oftime to run on regular hardware.

Figure 1 illustrates a basic cluster.



Figure 1. The basic cluster


Grid computing is a broad term that typically refers to aservice-oriented architecture (SOA) with collaboration among looselycoupled systems. Cluster-based HPC is a special case of grid computingin which the nodes are tightly coupled. A successful, well-knownproject in grid computing is SETI@home, the Search for ExtraterrestrialIntelligence program, which used the idle CPU cycles of a million homePCs via screen savers to analyze radio telescope data. A similarsuccessful project is the Folding@Home project for protein-foldingcalculations.


Common uses of high-performance clusters

Almost every industry needs fast processing power. With theincreasing availability of cheaper and faster computers, more companiesare interested in reaping the technological benefits. There is no upperboundary to the needs of computer processing power; even with the rapidincrease in power, the demand is considerably more than what'savailable.

Life sciences research

Proteins molecules are long flexible chains that can take on avirtually infinite number of 3D shapes. In nature, when put in asolvent, they quickly "fold" to their native states. Incorrect foldingis believed to lead to a variety of diseases like Alzheimer's;therefore, the study of protein folding is of fundamental importance.

One way scientists try to understand protein folding is bysimulating it on computers. In nature, folding occurs quickly (in aboutone millionth of a second), but it is so complex that its simulation ona regular computer could take decades. This focus area is a small onein an industry with many more such areas, but it needs seriouscomputational power.

Other focus areas from this industry include pharmaceuticalmodeling, virtual surgery training, condition and diagnosisvisualizations, total-record medical databases, and the Human GenomeProject.

Oil and gas exploration

Seismograms convey detailed information about the characteristics ofthe interior of the Earth and seafloors, and an analysis of the datahelps in detecting oil and other resources. Terabytes of data areneeded to reconstruct even small areas; this analysis obviously needs alot of computational power. The demand for computational power in thisfield is so great that supercomputing resources are often leased forthis work.

Other geological efforts require similar computing power, such asdesigning systems to predict earthquakes and designing multispectralsatellite imaging systems for security work.

Graphics rendering

Manipulating high-resolution interactive graphics in engineering,such as in aircraft engine design, has always been a challenge in termsof performance and scalability because of the sheer volume of datainvolved. Cluster-based techniques have been helpful in this area wherethe task to paint the display screen is split among various nodes ofthe cluster, which use their graphics hardware to do the rendering fortheir part of the screen and transmit the pixel information to a masternode that assembles the consolidated image for the screen.

These industry examples are only the tip of the iceberg; many moreapplications including astrophysical simulation, weather simulation,engineering design, financial modeling, portfolio simulation, andspecial effects in movies demand extensive computational resources. Thedemand for increasingly more power is here to stay.


How Linux and clusters have changed HPC

Before cluster-based computing, the typical supercomputer was avector processor that could typically cost over a million dollars dueto the specialized hardware and software.

With Linux and other freely available open source softwarecomponents for clustering and improvements in commodity hardware, thesituation now is quite different. You can build powerful clusters witha very small budget and keep adding extra nodes based on need.

The GNU/Linux operating system (Linux) has spurred the adoption ofclusters on a large scale. Linux runs on a wide variety of hardware,and high-quality compilers and other software like parallel filesystemsand MPI implementations are freely available for Linux. Also withLinux, users have the ability to customize the kernel for theirworkload. Linux is a recognized favorite platform for building HPCclusters.


Understanding hardware: Vector vs. cluster

To understand HPC hardware, it is useful to contrast the vectorcomputing paradigm with the cluster paradigm. These are competingtechnologies (Earth Simulator, a vector supercomputer, is still among the top 10 fastest supercomputers).

Fundamentally, both vector and scalar processors execute instructionsbased on a clock; what sets them apart is the ability of vectorprocessors to parallelize computations involving vectors (such asmatrix multiplication), which are commonly used in High PerformanceComputing. To illustrate this, suppose you have two double arrays a and b and want to create a third array x such that x[i]=a[i]+b[i].

Any floating-point operation like addition or multiplication is achieved in a few discrete steps:

  • Exponents are adjusted
  • Significands are added
  • The result is checked for rounding errors and so on

A vector processor parallelizes these internal steps by using a pipelining technique. Suppose there are six steps (as in IEEE arithmetic hardware) in a floating-point addition as shown in Figure 2.


Figure 2. A six-step pipeline with IEEE arithmetic hardware

A vector processor does these six steps in parallel -- if the ith array elements being added are in their 4th stage, the vector processor will execute the 3rd stage for the (i+1)th element, 2nd stage for the (i+2)th,and so on. As you can see, for a six-step floating-point addition, thespeedup factor will be very close to six times (at start and end, notall six stages are active) as all stages are active at any giveninstant (shown in red in Figure 2). A big benefit is thatparallelization is happening behind the scenes, and you need notexplicitly code this in your programs.

For the most part, all six steps can be executed in parallel leadingto an almost six-fold improvement in performance. The arrows indicatethe operations on the ith array index.

Compared to vector processing, cluster-based computing takes afundamentally different approach. Instead of using specially optimizedvector hardware, it uses standard scalar processors, but in largenumbers to perform several computations in parallel.

Some features of clusters are as follows:

  • Clusters are built using commodity hardware and cost a fraction of the vector processors. In many cases, the price is lower by more than an order of magnitude.
  • Clusters use a message-passing paradigm for communication, and programs have to be explicitly coded to make use of distributed hardware.
  • With clusters, you can add more nodes to the cluster based on need.
  • Open source software components and Linux lead to lower software costs.
  • Clusters have a much lower maintenance cost (they take up less space, take less power, and need less cooling).

Parallel programming and Amdahl's Law

Software and hardware go hand in hand when it comes to achievinghigh performance on a cluster. Programs must be written to explicitlytake advantage of the underlying hardware, and existing non-parallelprograms must be re-written if they are to perform well on a cluster.

A parallel program does many things at once. Just how many dependson the problem at hand. Suppose 1/N of the total time taken by aprogram is in a part that can not be parallelized, and the rest (1-1/N)is in the parallelizable part (see Figure 3).


Figure 3. Illustrating Amdahl's Law

In theory you could apply an infinite amount of hardware to do theparallel part in zero time, but the sequential part will see noimprovements. As a result, the best you can achieve is to execute theprogram in 1/N of the original time, but no faster. In parallelprogramming, this fact is commonly referred to as Amdahl's Law.

Amdahl's Law governs the speedup of using parallel processors on a problem versus using only one serial processor. Speedupis defined as the time it takes a program to execute in serial (withone processor) divided by the time it takes to execute in parallel(with many processors):

            T(1)            S = ------            T(j)            

Where T(j) is the time it takes to execute the program when using j processors.

In Figure 3, with enough nodes for parallel processing, T'par can bemade close to zero but Tseq does not change. At best, the parallelversion cannot be faster than the original by more than 1+Tpar/Tseq.

The real hard work in writing a parallel program is to make N aslarge as possible. But there is an interesting twist to it. Younormally attempt bigger problems on more powerful computers, andusually the proportion of the time spent on the sequential parts of thecode decreases with increasing problem size (as you tend to modify theprogram and increase the parallelizable portion to optimize theavailable resources). Therefore, the value of N automatically becomeslarge. (See the re-evaluation of Amdhal's Law in the Resources section later in this article.)


Approaches to parallel programming

Let's look at two parallel programming approaches: the distributed memory approach and the shared memory approach.

Distributed memory approach

It is useful to think a master-slave model here:

  • The master node divides the work between several slave nodes.
  • Slave nodes work on their respective tasks.
  • Slave nodes intercommunicate among themselves if they have to.
  • Slave nodes return back to the master.
  • The master node assembles the results, further distributes work, and so on.

Obvious practical problems in this approach stem from thedistributed-memory organization. Because each node has access to onlyits own memory, data structures must be duplicated and sent over thenetwork if other nodes want to access them, leading to networkoverhead. Keep these shortcomings and the master-slave model in mind inorder to write effective distributed-memory programs.

Shared memory approach

In the shared-memory approach, memory is common to all processors(such as SMP). This approach does not suffer from the problemsmentioned in the distributed-memory approach. Also, programming forsuch systems is easier since all the data is available to allprocessors and is not much different from sequential programming. Thebig issue with these systems is scalability: it is not easy to addextra processors.

Parallel programming (like all programming) is as much art asscience, always leaving room for major design improvements andperformance enhancements. Parallel programming has its own specialplace in computing; Part 2 of this series examines parallel programmingplatforms and examples.


When file I/O becomes a bottleneck

Some applications frequently need to read and write large amounts ofdata to the disk, which is often the slowest step in a computation.Faster hard drives help, but there are times when they are not enough.

The problem becomes especially pronounced if a physical diskpartition is shared between all nodes (using NFS, for example) as ispopular in Linux clusters. This is where parallel filesystems come inhandy.

Parallel filesystems spread the data in a file over severaldisks attached to multiple nodes of the cluster, known as I/O nodes.When a program tries to read a file, small portions of that file areread from several disks in parallel. This reduces the load on any givendisk controller and allows it to handle more requests. (PVFS is a goodexample of an open source parallel filesystem; disk performance ofbetter than 1 GBsec has been achieved on Linux clusters using standardIDE hard disks.)

PVFS is available as a Linux kernel module and can also be builtinto the Linux kernel. The underlying concept is simple (see Figure 4):

  • A meta-data server stores information on where different parts of the file are located.
  • Multiple I/O nodes store pieces of the file (any regular underlying filesystem like ext3 can be used by PVFS).

Figure 4. How PVFS works

When a compute node in a cluster wants to access a file in this parallel filesystem, it goes through the following steps:

  • It requests a file as usual, and the request goes to the underlying PVFS filesystem.
  • PVFS sends a request to the meta-data server (steps 1 and 2 in Figure 4), which informs the requesting node about the location of the file among the various I/O nodes.
  • Using this information, the compute node communicates directly with all the relevant I/O nodes to get all the pieces of the file (step 3).

All this is transparent to the calling application; the underlyingcomplexity of making requests to all the I/O nodes, ordering of thefile's contents, etc., are taken care of by PVFS.

A good thing about PVFS is that it allows binaries meant for regularfilesystems to run on it without modification -- this is somewhat of anexception in the parallel programming arena. (Some other parallelfilesystems are mentioned in Resources.)


Resources

Learn
  • Build your own cluster:
    • Building a Linux HPC Cluster with xCAT (IBM Redbook, September 2002) guides system architects and engineers toward a basic understanding of cluster technology, terminology, and the installation of a Linux High Performance Computing (HPC) cluster.
    • Linux HPC Cluster Installation (IBM Redbook, June 2001) focuses on xCAT (xCluster Administration Tools) for installation and administration.
    • "Building an HPC Cluster with Linux, pSeries, Myrinet, and Various Open Source Software" (IBM Redpaper, July 2004) explains how to bring a collection of pSeries nodes and a Myrinet interconnect to a state where a true HPC production workload can be run in a Linux clustered environment.
    • "Build a heterogeneous cluster with coLinux and openMosix" (developerWorks, February 2005) demonstrates how to build a mixed or hybrid HPC Linux cluster.
    • Linux Clustering with CSM and GPFS (IBM Redbook, January 2004) focuses on the Cluster System Management and the General Parallel Filesystem for Linux clusters.

  • Visit the SETI Project, and learn about SETI@home.
  • For more on protein folding, visit the Folding@Home Project, and get an introduction to protein folding.
  • Learn about IBM's Deep Computing Capacity on Demand, computing centers designed to handle compute-intensive tasks for petroleum and other industries.
  • John L. Gustafson explains a re-evaluation of Amdhal's Law.
  • Learn how to to achieve a gigabyte-per-second I/O throughput with commodity IDE disks using the PVFS.
  • The five-part series, "Build a digital animation system" (developerWorks, July 2005), showcases how to set up an HPC cluster you can use to run a professional animation system.
  • Find more resources for Linux developers in the developerWorks Linux zone.
  • To learn more about Grid computing, read New to Grid computing and check out the Grid zone on developerWorks.

Get products and technologies
  • Download the Parallel Virtual Filesystem.
  • Download xCAT on IBM alphaWorks, a tool kit for deploying and administering Linux clusters.
  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2, Lotus, Rational, Tivoli, and WebSphere.
  • Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.


Discuss
  • Get involved in the developerWorks community by participating in developerWorks blogs.

build a working cluster

High performance Linux clustering, part 2: build a working cluster - writing parallel programs and configuring your systemOneof the most interesting things about clustering is that you can buildLinux-based clusters with minimal effort if you have basic Linuxinstallation and troubleshooting skills.

by Aditya Narayan,founder, QCD Microsystems. First published by IBM at IBM developerWorksLinux (www.ibm.com/developer/linux). All rights retained by IBM and theauthor.
High-performance computing (HPC) has become easier, andtwo reasons are the adoption of open source software concepts and theintroduction and refinement of clustering technology. This second oftwo articles discusses parallel programming using MPI and gives anoverview of cluster management and benchmarking. It also shows you howto set up a Linux cluster using Oscar, an open source project forsetting up robust clusters.

Part 1 of this series, Clustering fundamentals,discusses the types of clusters, uses of clusters, HPC fundamentals,and reasons for the growth of clustering technology in high-performancecomputing.

This article covers parallel algorithms, and shows you how towrite parallel programs, set up clusters, and benchmark clusters. Welook at parallel programming using MPI and the basics of setting up aLinux cluster. In this article, meet Oscar, an open source project thathelps you set up robust clusters. Also, get an overview of clustermanagement and benchmarking concepts, complete with detailed steps torun the standard LINPACK tests on a cluster.

If you have installed Linux, you will be able to install andtroubleshoot a Linux cluster after you read this article. And thehelpful links in Resources below will help you get learn more aboutclustering.

Clusters and parallel programming platforms
As you saw in Part 1,HPC is mostly about parallel programming. Parallel programming is afairly well-established field, and several programming platforms andstandards have evolved around it over the past two decades.

The two common hardware platforms used in HPC are shared memory systems and distributed memory systems. For details, refer to Part 1.

On shared memory systems, High Performance Fortran is alanguage suited for parallel programming. It makes effective use ofdata parallelism and can act on entire arrays at once by executinginstructions on different indexes of an array in different processors.Consequently this provides automatic parallelization with minimaleffort on your part. (The Jamaica project is an example where astandard Java program is re-factored using a special compiler togenerate multithreaded code. The resulting code then automaticallytakes advantage of SMP architecture and executes in parallel.)

On distributed memory systems, the situation isradically different because the memory is distributed; you must writecode that is aware of the underlying distributed nature of the hardwareand use explicit message passing to exchange messages between differentnodes. Parallel Virtual Machines (PVM) were once a popular parallelprogramming platform for this, but lately MPI has become the de factostandard for writing parallel programs for clusters.

High-quality implementations for MPI are freely available for Fortran, C, and C++ for Linux. Two popular MPI implementations are

  • MPICH
  • LAM/MPI
Later in this article, we'll set up an MPICH-based Linux cluster and see an example of an MPI-based program.

Creating a simple Linux cluster
One of the most interestingthings about clustering is that you can build Linux-based clusters withminimal effort if you have basic Linux installation and troubleshootingskills. Let's see how this is done.

For our cluster we'll use MPICH and a set of regular Linuxworkstations. For simplicity, and to emphasize fundamentals, we'llbuild just the bare minimum system that you can use to run a parallelprogram in a clustered environment.

The seven steps in this section show how to build ourbare-bones system. Building robust clusters and managing them involvemore effort and will be covered later in this article.

Step 1:
You need at least two Linux machines if you want areal cluster. Two VMware images will also do just fine. (With VMware,obviously you do not expect any performance benefits. In fact, therewill definitely be a performance hit because the CPU will be shared.)Make sure these machines can ping each other by name. If not, addappropriate entries in /etc/hosts.

Step 2:
Install the GNU C compiler and GNU FORTRAN compiler.

Step 3a:
Configure SSH for all your nodes to allow it toexecute commands without being asked for a password. The aim is to getsomething like ssh -n host whoamito work without being asked for a password. SSH will be used as the wayto communicate between different machines. (You can use rsh also forthis purpose.)


Step 3b

ssh-keygen -f /tmp/key -t dsa will give you a private key in a file called key and a public key in a file called key.pub.

Step 3c

If you are building your cluster as root and will be running yourprograms as root (obviously do this only during experimentation), thencopy the private key into the file /root/.ssh/identity and the publickey into the file /root/.ssh/authorized_keys in all nodes of yourcluster.

To confirm that everything works fine, execute the following command: ssh -n hostname 'date' and see if the command executes without errors. You should test this for all nodes to be sure.

Note: You may have to configure your firewall to allow the nodes to communicate with each other.

Step 4a

Now we'll install MPICH. Download the UNIX version of MPICH from the anl.gov Web site (see Resources for a link. The following is a quick overview.

Step 4b

Let's say you downloaded mpich.tar.gz in /tmp:

cd /tmp
tar -xvf mpich.tar.gz (Let's say you get a directory /tmp/mpich-1.2.6 as a result)
cd /tmp/mpich-1.2.6

Step 4c

./configure -rsh=ssh -- this tells MPICH to use ssh as its communication mechanism.

Step 4d

make -- at the end of this step you should have your MPICH installation ready for use.

Step 5

To make MPICH aware of all your nodes, edit the file/tmp/mpich-1.2.6/util/machines/machines.LINUX and add hostnames of allnodes to this file so that your MPICH installation becomes aware of allyour nodes. If you add nodes at a later stage, edit this file.

Step 6

Copy the directory /tmp/mpich-1.2.6 to all nodes in your cluster.

Step 7

Run a few test programs from the examples directory:

  • cd /tmp/mpich-1.2.6/utils/examples
  • make cpi
  • /tmp/mpich-1.2.6/bin/mpirun -np 4 cpi -- instructs MPICH to run it on four processors; if your setup has less than four, don't worry; MPICH will create processes to compensate for the lack of physical hardware.

Your cluster is ready! As you can see, all the heavy lifting is beingdone by the MPI implementation. As noted earlier, this is a bare-bonescluster, and much of it is based on doing manual work by making suremachines can communicate with each other (ssh is configured, MPICH ismanually copied, etc.).


Open source cluster application resources

Clearly, it will be hard to maintain the cluster above. It is notconvenient to copy files to every node, set up SSH and MPI on everynode that gets added, make appropriate changes when a node is removed,and so on.

Fortunately, excellent open source resources can help you set up andmanage robust production clusters. OSCAR and Rocks are two examples.Most of the things we did to create our cluster are handled by theseprograms in an automated manner.

Figure 1 is a schematic of a simple cluster.


Figure 1. A simple sample cluster

OSCAR also supports automated Linux installations using PXE(Portable Execution Environment). OSCAR also helps with thesefunctions:

  • Automated Linux installation on compute nodes.
  • DHCP and TFTP configuration (for Linux installation using PXE). Most new computers have a BIOS that allows you to boot the machine using a DHCP server. The BIOS has a built-in DHCP client that requests an operation system image that is transferred to it from the DHCP server using TFTP. This Linux image is created by OSCAR, and the DHCP and TFTP installation and configuration are also handled by OSCAR.
  • SSH configuration.
  • Automatic NFS setup.
  • MPI installation and configuration (both MPICH and LAM/MPI).
  • PVM installation and configuration (if you want to use PVM instead of MPI).
  • Private subnet configuration between head node and compute nodes.
  • Scheduler (Open PBS and Maui) installation for automatic management of multiple users submitting jobs to the cluster.
  • Ganglia installation for performance monitoring.
  • Provision to add/remove nodes.

Currently OSCAR supports several versions of Red Hat Linux; for other supported distributions, visit the OSCAR Web site (see Resources for a link). You may have to tweak a few of the scripts based on the errors you get during setup.


Creating a Linux cluster using OSCAR

Let's start with OSCAR sources and build a fully functional cluster.Say you have two or more regular workstations at your disposal withnetwork connectivity. Make one of these the head node and the rest thecompute nodes.

As we did when building the Linux cluster, we'll go through thesteps to be performed on the head node. OSCAR will configure all othernodes automatically, including the OS installation. (See the OSCARinstallation guide; the following is a conceptual overview of theinstallation process.)

Step 1

Install Linux on the head node. Make sure you install an X server also.

Step 2

mkdir /tftpboot, mkdir /tftpboot/rpm. Copy all RPMs fromthe installation CD to this directory. These RPMs are used to build theclient image. Not all RPMs will be eventually required, but it's goodto have them to automatically build the image.

Step 3

It helps to ensure that MySQL is installed and correctly configuredand that you can access MySQL from Perl since OSCAR stores all itsconfiguration information in MySQL and uses Perl to access it. Thisstep is usually optional and OSCAR does this for you, but sometimesthis step fails, especially if you are installing OSCAR on anunsupported distribution.

Step 4

Download OSCAR sources and compile:

configure
make install

Step 5

Start the OSCAR wizard. Assuming you want the cluster to use eth1 for connectivity between the cluster nodes, use /opt/oscar/install_cluster eth1.

Step 6

At this stage, go through the OSCAR screens step by step. Be sure to execute the steps in correct order:

  1. Select packages to customize your OSCAR install. If you are not familiar with these packages, ignore this for the moment.
  2. Install OSCAR packages.
  3. Build Client image. This what the compute nodes will use.
  4. Define OSCAR clients. This defines the compute nodes. You need to specify the number of nodes you want to use and the subnet they will use. If you're not sure now how many nodes you'll have, you can modify this at a later stage.
  5. Map MAC address of different nodes to IP addresses. For this step, each node must be booted using the PXE network boot option in the BIOS.

Step 7

Finally, run the tests. If all goes well, each test should succeed.Sometimes the tests fail on the first try even when nothing is wrong.You can always run the tests by manually executing the test scriptsfrom /opt/oscar.

If you want to add new nodes now, start the OSCAR wizard again andadd nodes. The Linux OS will be installed on these nodes automaticallyby OSCAR using PXE.

Now that the cluster is ready, you can run parallel programs, add orremove new nodes based on need, and monitor the status with Ganglia.


Managing the cluster

When it comes to managing a cluster in a production environment witha large user base, job scheduling and monitoring are crucial.

Job scheduling

MPI will start and stop processes on the various nodes, but this islimited to one program. On a typical cluster, multiple users will wantto run their programs, and you must use scheduling software to ensureoptimal use of the cluster.

A popular scheduling system is OpenPBS, which is installedautomatically by OSCAR. Using it you can create queues and submit jobson them. Within OpenPBS, you can also create sophisticatedjob-scheduling policies.

OpenPBS also lets you view executing jobs, submit jobs, and canceljobs. It also allows control over the maximum amount of CPU timeavailable to a particular job, which is quite useful for anadministrator.

Monitoring

An important aspect of managing clusters is monitoring, especiallyif your cluster has a large number of nodes. Several options areavailable, such as Ganglia (which comes with OSCAR) and Clumon.

Ganglia has a Web-based front end and provides real-time monitoringfor CPU and memory usage; you can easily extend it to monitor justabout anything. For example, with simple scripts you can make Gangliareport on CPU temperatures, fan speeds, etc. In the following sections,we will write some parallel programs and run them on this cluster.


A parallel algorithm

Parallel programming has its own set of parallel algorithms thattake advantage of the underlying hardware. Let's take a look at onesuch algorithm. Let's say one node has a set of N integers to add. The time to do this in the regular way scales as O(N) (if it takes 1ms to add 100 integers, it takes roughly 2ms to add 200 integers, and so on).

It may seem hard to improve upon the linear scaling of this problem,but quite remarkably, there is a way. Let's say a program is executingin a cluster with four nodes, each node has an integer in its memory,and the aim is to add these four numbers.

Consider these steps:

  1. Node 2 sends its integer to node 1, and node 4 sends its integer to node 3. Now nodes 1 and 3 have two integers each.
  2. The integers are added on these 2 nodes.
  3. This partial sum is sent by node 3 to node 1. Now node 1 has two partial sums.
  4. Node 1 adds these partial sums to get the final sum.

As you can see, if you originally had 2N numbers, this approach could add them in ~N steps. Therefore the algorithm scales as O(log2N), which is a big improvement over O(N) for the sequential version. (If it takes 1ms to add 128 numbers, it takes (8/7) ~ 1.2ms to add 256 numbers using this approach. In the sequential approach, it would have taken 2ms.)

This approach also frees up half the compute nodes at each step. This algorithm is commonly referred to as recursive halving and doubling and is the underlying mechanism behind the class of reduce function calls in MPI, which we discuss next.

Parallel algorithms exist for a lot of practical problems in parallel programming.


Parallel programming using MPI to multiply a matrix by a vector

Now that you know the fundamentals of parallel programming platformsand have a cluster ready, let's write a program that makes good use ofthe cluster. Instead of a traditional "hello world", let's jumpstraight to the real thing and write an MPI-based program to multiplytwo matrices.

We'll use the algorithm described in the section on parallel algorithmsto solve this problem elegantly. Let's say we have a 4X4 matrix that wewant to multiply with a vector (a 4X1 matrix). We'll make a slightchange to the standard technique for multiplying matrices so that theprevious algorithm can be applied here. See Figure 2 for anillustration of this technique.


Figure 2. Multiplying a matrix by a vector

Instead of multiplying the first row by the first column and so on,we'll multiply all elements in the first column by the first element ofthe vector, second column elements by the second element of the vector,and so on. This way, you get a new 4X4 matrix as a result. After this,you add all four elements in a row for each row and get the 4X1 matrix,which is our result.

The MPI API has several functions that you can apply directly to solve this problem, as shown in Listing 1.


Listing 1. Matrix multiplication using MPI
            /*            To compile: 'mpicc -g -o matrix matrix.c'            To run: 'mpirun -np 4 matrix'            "-np 4" specifies the number of processors.            */            #include             #include             #define SIZE 4            int main(int argc, char **argv) {            int j;            int rank, size, root;            float X[SIZE];            float X1[SIZE];            float Y1[SIZE];            float Y[SIZE][SIZE];            float Z[SIZE];            float z;            root = 0;            /* Initialize MPI. */            MPI_Init(&argc, &argv);            MPI_Comm_rank(MPI_COMM_WORLD, &rank);            MPI_Comm_size(MPI_COMM_WORLD, &size);            /* Initialize X and Y on root node. Note the row/col alignment. This is specific  to C */            if (rank == root) {            Y[0][0] = 1; Y[1][0] = 2; Y[2][0] = 3; Y[3][0] = 4;            Y[0][1] = 5; Y[1][1] = 6; Y[2][1] = 7; Y[3][1] = 8;            Y[0][2] = 9; Y[1][2] = 10;Y[2][2] = 11;Y[3][2] = 12;            Y[0][3] = 13;Y[1][3] = 14;Y[2][3] = 15;Y[3][3] = 16;            Z[0] = 1;            Z[1] = 2;            Z[2] = 3;            Z[3] = 4;            }            MPI_Barrier(MPI_COMM_WORLD);            /*  root scatters matrix Y in 'SIZE' parts to all nodes as the matrix Y1 */            MPI_Scatter(Y,SIZE,MPI_FLOAT,Y1,SIZE,MPI_FLOAT,root,MPI_COMM_WORLD);            /* Z is also scattered to all other nodes from root. Since one element is sent to            all nodes, a scalar variable z is used. */            MPI_Scatter(Z,1,MPI_FLOAT,&z,1,MPI_FLOAT, root,MPI_COMM_WORLD);            /* This step is carried out on all nodes in parallel.*/            for(j=0;j


Measuring performance

Clusters are built to perform, and you need to know how fast theyare. It's common to think that the processor frequency determinesperformance. While this is true to a certain extent, it is of littlevalue in comparing processors from different vendors or even differentprocessor families from the same vendor because different processors dodifferent amounts of work in a given number of clock cycles. This wasespecially obvious when we compared vector processors with scalarprocessors in Part 1.

A more natural way to compare performance is to run some standardtests. Over the years a test known as the LINPACK benchmark has becomea gold standard when comparing performance. It was written by JackDongarra more than a decade ago and is still used by top500.org (see Resources for a link).

This test involves solving a dense system of N linear equations, where the number of floating-point operations is known (of the order of N^3).This test is well suited to speed test computers meant to runscientific applications and simulations because they tend to solvelinear equations at some stage or another.

The standard unit of measurement is the number of floating-point operations or flops per second (in this case, a flop is either an addition or a multiplication of a 64-bit number). The test measures the following:

  • Rpeak, theoretical peak flops. In a June 2005 report, IBM Blue Gene/L clocks at 183.5 tflops (trillion flops).
  • Nmax, the matrix size N that gives the highest flops. For Blue Gene this number is 1277951.
  • Rmax, the flops attained for Nmax. For Blue Gene, this number is 136.8 tflops.

To appreciate these numbers, consider this: What IBM BlueGene/L cando in one second, your home computer may take up to five days.

Now let's discuss how you can benchmark your own Linux cluster.Other tests besides LINPACK are the HPC Challenge Benchmark and the NASbenchmarks.


Benchmarking your Linux cluster

To run the LINPACK benchmark on your Linux cluster, you need to getthe parallel version of LINPACK and configure it for the cluster. We'llgo through this process step by step.

Warning: The following uses generic linear algebra libraries;use it only as a guideline. For a true test, get libraries that havebeen optimized for your environment.

Step 1

Download hpl.tgz, the parallel (MPI) version of the LINPACK benchmark, from netlib.org (see Resources for a link).

Step 2

Download blas_linux.tgz, pre-compiled BLAS (Basic Linear AlgebraSubprograms), also from netlib.org. For simplicity, you can use apre-compiled reference implementation of BLAS for Linux, but for betterresults, you should use BLAS provided by your hardware vendor orauto-tune it using the open source ATLAS project .

Step 3

mkdir /home/linpack; cd /home/linpack (we'll install everything in /home).

Step 4

Uncompress and expand blas_linux.tgz. You should get an archive filecalled blas_linux.a. If you can see the file, ignore any errors you mayget.

Step 5

Uncompress and expand hpl.tgz. You'll get a directory called hpl.

Step 6

Copy any of the configuration files, such as theMake.Linux_PII_FBLAS file, from hpl/setup to the hpl directory andrename the copy in the hpl directory Make.LinuxGeneric.

Step 7

Edit the file Make.LinuxGeneric as follows, changing the values to suit your environment:

TOPdir = /home/linpack/hpl
MPdir = /tmp/mpich-1.2.6
LAlib = /home/linpack/blas_linux.a
CC = /tmp/mpich-1.2.6/bin/mpicc
LINKER = /tmp/mpich-1.2.6/bin/mpif77

These five places specify the top-level directory of LINPACK from step1, the top-level directory of your MPICH installation, and the locationof the reference BLAS archive from step 2.

Step 8

Compile HPL now:

make arch=LinuxGeneric

If there are no errors, you will get two files, xhpl and HPL.dat, in /home/linpack/hpl/bin/LinuxGeneric.

Step 9

Before you run the LINPACK benchmark on your cluster, copy the entire/home/linpack to all the machines in your cluster. (If you created thecluster using OSCAR and have NFS shares configured, skip this step.)

Now cd to the directory with the executable you created in step 8 and run some tests (such as /tmp/mpich-1.2.6/bin/mpicc -np 4 xhpl). You should see some tests being executed with the results presented as GFLOPs.

Note: The above will run some standard tests based on defaultsettings for matrix sizes. Use the file HPL.dat to tune your tests.Detailed information on tuning is in the file /home/linpack/hpl/TUNING.


The IBM Blue Gene/L

Now that you've built our own cluster, let's take a quick look at Blue Gene/L,the state-of-the-art in cluster-based supercomputers. Blue Gene/L is anengineering feat of massive proportions, and a discussion that doesjustice to it would be clearly outside the scope of this article. We'lljust touch the surface here.

Three years ago when the Earth Simulator, a vectorprocessor, became the fastest supercomputer, many people predicted thedemise of the cluster as a supercomputing platform and the resurrectionof the vector processor; this turned out to be premature. Blue Gene/Lbeat the Earth Simulator decisively on the standard LINPACK benchmark.

Blue Gene/L is not built from commodity workstations, but it doesuse the standard clustering concepts discussed in this series. It usesan MPI implementation derived from MPICH that we used in our cluster.It also runs standard 32-bit PowerPC CPUs (although these are based onsystem-on-a-chip or SoC technology) at 700MHz, which leads todramatically lower cooling and power requirements than any machine inits class.

Blue Gene/L is a big machine and has 65,536 compute nodes (each ofwhich runs a custom operating system) and 1,024 dedicated I/O nodes(running the Linux kernel). With such a large number of nodes,networking takes special importance and Blue Gene/L uses five differenttypes of networks, each optimized for a particular purpose.

A wealth of information on Blue Gene/L is available in a recent IBM Systems Journaldevoted entirely to it. In Blue Gene/L, we see a realization of ahighly scalable MPP and clear proof that the cluster-based computingparadigm is here to stay.


Resources

Learn
  • Read the first article in this two-part series, "Clustering fundamentals" (developerWorks, September 2005).
  • The Jamaica Project compilers explains parallelizing and dynamic compilers and supercomputer optimization using parallelism.
  • Parallel Virtual Machine (PVM)
    • PVM is a software package that permits a heterogeneous collection of UNIX and/or Windows computers hooked together by a network to be used as a single large parallel computer.
    • This downloadable PostScript article explains PVM's advantages over MPI.
    • This article is an excellent comparison of PVM and MPI.

  • Message Passing Interface (MPI)
    • The MPI Forum is an open group with representatives from many organizations that define and maintain the MPI standard.
    • MPICH is a freely available, portable implementation of MPI.
    • LAM/MPI is a high-quality open-source implementation of the MPI specification intended for production as well as research use.
    • This MPI installation guide can get your started.

  • OSCAR (Open Source Cluster Application Resources) is a snapshot of the best known methods for building, programming, and using HPC clusters.
  • Rocks Cluster is an open source Linux cluster implementation.
  • OpenPBS is the original version of the Portable Batch System.
  • Clumon is a cluster-monitoring system developed at the National Center for Supercomputing Applications to keep track of its Linux super-clusters.
  • Benchmarking
    • Top500.org has a good introduction to the LINPACK benchmark.
    • And here is an excellent paper by Jack Dongarra on the HPC Challenge benchmark and a thorough introduction to it.
    • The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers.
    • HPL is a portable implementation of the high-performance LINPACK benchmark for distributed-memory computers.
    • BLAS, the Basic Linear Algebra Subprograms, are routines that provide standard building blocks for performing basic vector and matrix operations.
    • The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation.

  • This IBM Journal of Research and Development highlights BlueGene/L.
  • Build your own
    • Building a Linux HPC Cluster with xCAT (Redbook, September 2002) guides system architects and engineers toward a basic understanding of cluster technology, terminology, and the installation of a Linux High-Performance Computing (HPC) cluster.
    • Linux HPC Cluster Installation (Redbook, June 2001) focuses on xCAT (xCluster Administration Tools) for installation and administration.
    • Building an HPC Cluster with Linux, pSeries, Myrinet, and Various Open Source Software (Redpaper, July 2004) explains how to bring a collection of pSeries nodes and a Myrinet interconnect to a state where a true HPC production workload can be run in a Linux clustered environment.
    • Build a heterogeneous cluster with coLinux and openMosix (developerWorks, February 2005) demonstrates how to build a mixed or hybrid HPC Linux cluster.
    • Linux Clustering with CSM and GPFS (Redbook, January 2004) focuses on the Cluster System Management and the General Parallel Filesystem for Linux clusters.

  • Find more resources for Linux developers in the developerWorks Linux zone.

Get products and technologies
  • Download benchmarks
    • Download the latest LINPACK results.
    • Run your own Java-based LINPACK on your PC.
    • Get the HPL benchmark.
    • Get the BLAS subroutines.
    • Download ATLAS.

  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2, Lotus, Rational, Tivoli, and WebSphere.
  • Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.


Discuss
  • Get involved in the developerWorks community by participating in developerWorks blogs.