迈克杰克逊那年死的:High performance Linux clustering, Part 2: Build a working cluster

来源:百度文库 编辑:九乡新闻网 时间:2024/04/25 22:14:11

High performance Linux clustering, Part 2: Build a working cluster

Writing parallel programs and configuring your system

Aditya Narayan, Founder, QCD Microsystems

Summary:  High Performance Computing (HPC) hasbecome easier, and two reasons are the adoption of open source softwareconcepts and the introduction and refinement of clustering technology.This second of two articles discusses parallel programming using MPI andgives an overview of cluster management and benchmarking. It also showsyou how to set up a Linux? cluster using OSCAR, an open source projectfor setting up robust clusters.

View more content in this series

Tag this!Update My dW interests (Log in | What's this?) Skip to help for Update My dW interests

Date:  27 Oct 2005
Level:  Intermediate

Activity:  21333 views
Comments:   0 (View | Add comment - Sign in)

Average rating (46 votes)
Rate this article

Part 1 of this series, Clustering fundamentals,discusses the types of clusters, uses of clusters, HPC fundamentals,and reasons for the growth of clustering technology in High PerformanceComputing.

This article covers parallel algorithms, and shows you how to writeparallel programs, set up clusters, and benchmark clusters. We look atparallel programming using MPI and the basics of setting up a Linuxcluster. In this article, meet OSCAR, an open source project that helpsyou set up up robust clusters. Also, get an overview of clustermanagement and benchmarking concepts, complete with detailed steps torun the standard LINPACK tests on a cluster.

If you have installed Linux, you will be able to install andtroubleshoot a Linux cluster after you read this article. And thehelpful links in Resources below will help you get learn more about clustering.

Clusters and parallel programming platforms

As you saw in Part 1,HPC is mostly about parallel programming. Parallel programming is afairly well-established field, and several programming platforms andstandards have evolved around it over the past two decades.

The two common hardware platforms used in HPC are shared memory systems and distributed memory systems. For details, refer to Part 1.

On shared memory systems, High Performance FORTRAN is a languagesuited for parallel programming. It makes effective use of dataparallelism and can act on entire arrays at once by executinginstructions on different indexes of an array in different processors.Consequently this provides automatic parallelization with minimal efforton your part. (The Jamaica project is an example where a standard Javaprogram is re-factored using a special compiler to generatemultithreaded code. The resulting code then automatically takesadvantage of SMP architecture and executes in parallel.)

On distributed memory systems, the situation is radicallydifferent because the memory is distributed; you must write code that isaware of the underlying distributed nature of the hardware and useexplicit message passing to exchange messages between different nodes.Parallel Virtual Machines (PVM) were once a popular parallel programmingplatform for this, but lately MPI has become the de facto standard forwriting parallel programs for clusters.

High-quality implementations for MPI are freely available for FORTRAN, C, and C++ for Linux. Two popular MPI implementations are

  • MPICH
  • LAM/MPI

Later in this article, we'll set up an MPICH-based Linux cluster and see an example of an MPI-based program.

Creating a simple Linux cluster

One of the most interesting things about clustering is that you canbuild Linux-based clusters with minimal effort if you have basic Linuxinstallation and troubleshooting skills. Let's see how this is done.

For our cluster we'll use MPICH and a set of regular Linux workstations.For simplicity, and to emphasize fundamentals, we'll build just thebare minimum system that you can use to run a parallel program in aclustered environment.

The seven steps in this section show how to build our bare-bones system.Building robust clusters and managing them involve more effort and willbe covered later in this article.

Step 1

You need at least two Linux machines if you want a real cluster. TwoVMware images will also do just fine. (With VMware, obviously you do notexpect any performance benefits. In fact, there will definitely be aperformance hit because the CPU will be shared.) Make sure thesemachines can ping each other by name. If not, add appropriate entries in/etc/hosts.

Step 2

Install the GNU C compiler and GNU FORTRAN compiler.

Step 3a

Configure SSH for all your nodes to allow it to execute commands withoutbeing asked for a password. The aim is to get something like ssh -n host whoamito work without being asked for a password. SSH will be used as the wayto communicate between different machines. (You can use rsh also forthis purpose.)

Step 3b

ssh-keygen -f /tmp/key -t dsa will give you a private key in a file called key and a public key in a file called key.pub.

Step 3c

If you are building your cluster as root and will be running yourprograms as root (obviously do this only during experimentation), thencopy the private key into the file /root/.ssh/identity and the publickey into the file /root/.ssh/authorized_keys in all nodes of yourcluster.

To confirm that everything works fine, execute the following command: ssh -n hostname 'date' and see if the command executes without errors. You should test this for all nodes to be sure.

Note: You may have to configure your firewall to allow the nodes to communicate with each other.

Step 4a

Now we'll install MPICH. Download the UNIX version of MPICH from the anl.gov Web site (see Resources for a link. The following is a quick overview.

Step 4b

Let's say you downloaded mpich.tar.gz in /tmp:

cd /tmp
tar -xvf mpich.tar.gz (Let's say you get a directory /tmp/mpich-1.2.6 as a result)
cd /tmp/mpich-1.2.6

Step 4c

./configure -rsh=ssh -- this tells MPICH to use ssh as its communication mechanism.

Step 4d

make -- at the end of this step you should have your MPICH installation ready for use.

Step 5

To make MPICH aware of all your nodes, edit the file/tmp/mpich-1.2.6/util/machines/machines.LINUX and add hostnames of allnodes to this file so that your MPICH installation becomes aware of allyour nodes. If you add nodes at a later stage, edit this file.

Step 6

Copy the directory /tmp/mpich-1.2.6 to all nodes in your cluster.

Step 7

Run a few test programs from the examples directory:

  • cd /tmp/mpich-1.2.6/utils/examples
  • make cpi
  • /tmp/mpich-1.2.6/bin/mpirun -np 4 cpi -- instructs MPICH to run it on four processors; if your setup has less than four, don't worry; MPICH will create processes to compensate for the lack of physical hardware.

Your cluster is ready! As you can see, all the heavy lifting is beingdone by the MPI implementation. As noted earlier, this is a bare-bonescluster, and much of it is based on doing manual work by making suremachines can communicate with each other (ssh is configured, MPICH ismanually copied, etc.).

Open source cluster application resources

Clearly, it will be hard to maintain the cluster above. It is notconvenient to copy files to every node, set up SSH and MPI on every nodethat gets added, make appropriate changes when a node is removed, andso on.

Fortunately, excellent open source resources can help you set up andmanage robust production clusters. OSCAR and Rocks are two examples.Most of the things we did to create our cluster are handled by theseprograms in an automated manner.

Figure 1 is a schematic of a simple cluster.


Figure 1. A simple sample cluster

OSCAR also supports automated Linux installations using PXE (PortableExecution Environment). OSCAR also helps with these functions:

  • Automated Linux installation on compute nodes.

  • DHCP and TFTP configuration (for Linux installation using PXE). Most new computers have a BIOS that allows you to boot the machine using a DHCP server. The BIOS has a built-in DHCP client that requests an operation system image that is transferred to it from the DHCP server using TFTP. This Linux image is created by OSCAR, and the DHCP and TFTP installation and configuration are also handled by OSCAR.

  • SSH configuration.

  • Automatic NFS setup.

  • MPI installation and configuration (both MPICH and LAM/MPI).

  • PVM installation and configuration (if you want to use PVM instead of MPI).

  • Private subnet configuration between head node and compute nodes.

  • Scheduler (Open PBS and Maui) installation for automatic management of multiple users submitting jobs to the cluster.

  • Ganglia installation for performance monitoring.

  • Provision to add/remove nodes.

Currently OSCAR supports several versions of Red Hat Linux; for other supported distributions, visit the OSCAR Web site (see Resources for a link). You may have to tweak a few of the scripts based on the errors you get during setup.

Creating a Linux cluster using OSCAR

Let's start with OSCAR sources and build a fully functional cluster.Say you have two or more regular workstations at your disposal withnetwork connectivity. Make one of these the head node and the rest thecompute nodes.

As we did when building the Linux cluster, we'll go through the steps tobe performed on the head node. OSCAR will configure all other nodesautomatically, including the OS installation. (See the OSCARinstallation guide; the following is a conceptual overview of theinstallation process.)

Step 1

Install Linux on the head node. Make sure you install an X server also.

Step 2

mkdir /tftpboot, mkdir /tftpboot/rpm.Copy all RPMs from the installation CD to this directory. These RPMsare used to build the client image. Not all RPMs will be eventuallyrequired, but it's good to have them to automatically build the image.

Step 3

It helps to ensure that MySQL is installed and correctly configured andthat you can access MySQL from Perl since OSCAR stores all itsconfiguration information in MySQL and uses Perl to access it. This stepis usually optional and OSCAR does this for you, but sometimes thisstep fails, especially if you are installing OSCAR on an unsupporteddistribution.

Step 4

Download OSCAR sources and compile:

configure
make install

Step 5

Start the OSCAR wizard. Assuming you want the cluster to use eth1 for connectivity between the cluster nodes, use /opt/oscar/install_cluster eth1.

Step 6

At this stage, go through the OSCAR screens step by step. Be sure to execute the steps in correct order:

  1. Select packages to customize your OSCAR install. If you are not familiar with these packages, ignore this for the moment.

  2. Install OSCAR packages.
  3. Build Client image. This what the compute nodes will use.
  4. Define OSCAR clients. This defines the compute nodes. You need to specify the number of nodes you want to use and the subnet they will use. If you're not sure now how many nodes you'll have, you can modify this at a later stage.
  5. Map MAC address of different nodes to IP addresses. For this step, each node must be booted using the PXE network boot option in the BIOS.

Step 7

Finally, run the tests. If all goes well, each test should succeed.Sometimes the tests fail on the first try even when nothing is wrong.You can always run the tests by manually executing the test scripts from/opt/oscar.

If you want to add new nodes now, start the OSCAR wizard again and addnodes. The Linux OS will be installed on these nodes automatically byOSCAR using PXE.

Now that the cluster is ready, you can run parallel programs, add orremove new nodes based on need, and monitor the status with Ganglia.

Managing the cluster

When it comes to managing a cluster in a production environment with alarge user base, job scheduling and monitoring are crucial.

Job scheduling

MPI will start and stop processes on the various nodes, but this islimited to one program. On a typical cluster, multiple users will wantto run their programs, and you must use scheduling software to ensureoptimal use of the cluster.

A popular scheduling system is OpenPBS, which is installed automaticallyby OSCAR. Using it you can create queues and submit jobs on them.Within OpenPBS, you can also create sophisticated job-schedulingpolicies.

OpenPBS also lets you view executing jobs, submit jobs, and cancel jobs.It also allows control over the maximum amount of CPU time available toa particular job, which is quite useful for an administrator.

Monitoring

An important aspect of managing clusters is monitoring, especially ifyour cluster has a large number of nodes. Several options are available,such as Ganglia (which comes with OSCAR) and Clumon.

Ganglia has a Web-based front end and provides real-time monitoring forCPU and memory usage; you can easily extend it to monitor just aboutanything. For example, with simple scripts you can make Ganglia reporton CPU temperatures, fan speeds, etc. In the following sections, we willwrite some parallel programs and run them on this cluster.

A parallel algorithm

Parallel programming has its own set of parallel algorithms that takeadvantage of the underlying hardware. Let's take a look at one suchalgorithm. Let's say one node has a set of N integers to add. The time to do this in the regular way scales as O(N) (if it takes 1ms to add 100 integers, it takes roughly 2ms to add 200 integers, and so on).

It may seem hard to improve upon the linear scaling of this problem, butquite remarkably, there is a way. Let's say a program is executing in acluster with four nodes, each node has an integer in its memory, andthe aim is to add these four numbers.

Consider these steps:

  1. Node 2 sends its integer to node 1, and node 4 sends its integer to node 3. Now nodes 1 and 3 have two integers each.
  2. The integers are added on these 2 nodes.
  3. This partial sum is sent by node 3 to node 1. Now node 1 has two partial sums.
  4. Node 1 adds these partial sums to get the final sum.

As you can see, if you originally had 2N numbers, this approach could add them in ~N steps. Therefore the algorithm scales as O(log2N), which is a big improvement over O(N) for the sequential version. (If it takes 1ms to add 128 numbers, it takes (8/7) ~ 1.2ms to add 256 numbers using this approach. In the sequential approach, it would have taken 2ms.)

This approach also frees up half the compute nodes at each step. This algorithm is commonly referred to as recursive halving and doubling and is the underlying mechanism behind the class of reduce function calls in MPI, which we discuss next.

Parallel algorithms exist for a lot of practical problems in parallel programming.

Parallel programming using MPI to multiply a matrix by a vector

Now that you know the fundamentals of parallel programming platforms andhave a cluster ready, let's write a program that makes good use of thecluster. Instead of a traditional "hello world", let's jump straight tothe real thing and write an MPI-based program to multiply two matrices.

We'll use the algorithm described in the section on parallel algorithmsto solve this problem elegantly. Let's say we have a 4X4 matrix that wewant to multiply with a vector (a 4X1 matrix). We'll make a slightchange to the standard technique for multiplying matrices so that theprevious algorithm can be applied here. See Figure 2 for an illustrationof this technique.


Figure 2. Multiplying a matrix by a vector

Instead of multiplying the first row by the first column and so on,we'll multiply all elements in the first column by the first element ofthe vector, second column elements by the second element of the vector,and so on. This way, you get a new 4X4 matrix as a result. After this,you add all four elements in a row for each row and get the 4X1 matrix,which is our result.

The MPI API has several functions that you can apply directly to solve this problem, as shown in Listing 1.


Listing 1. Matrix multiplication using MPI
/*            To compile: 'mpicc -g -o matrix matrix.c'            To run: 'mpirun -np 4 matrix'            "-np 4" specifies the number of processors.            */            #include             #include             #define SIZE 4            int main(int argc, char **argv) {            int j;            int rank, size, root;            float X[SIZE];            float X1[SIZE];            float Y1[SIZE];            float Y[SIZE][SIZE];            float Z[SIZE];            float z;            root = 0;            /* Initialize MPI. */            MPI_Init(&argc, &argv);            MPI_Comm_rank(MPI_COMM_WORLD, &rank);            MPI_Comm_size(MPI_COMM_WORLD, &size);            /* Initialize X and Y on root node. Note the row/col alignment. This is specific  to C */            if (rank == root) {            Y[0][0] = 1; Y[1][0] = 2; Y[2][0] = 3; Y[3][0] = 4;            Y[0][1] = 5; Y[1][1] = 6; Y[2][1] = 7; Y[3][1] = 8;            Y[0][2] = 9; Y[1][2] = 10;Y[2][2] = 11;Y[3][2] = 12;            Y[0][3] = 13;Y[1][3] = 14;Y[2][3] = 15;Y[3][3] = 16;            Z[0] = 1;            Z[1] = 2;            Z[2] = 3;            Z[3] = 4;            }            MPI_Barrier(MPI_COMM_WORLD);            /*  root scatters matrix Y in 'SIZE' parts to all nodes as the matrix Y1 */            MPI_Scatter(Y,SIZE,MPI_FLOAT,Y1,SIZE,MPI_FLOAT,root,MPI_COMM_WORLD);            /* Z is also scattered to all other nodes from root. Since one element is sent to            all nodes, a scalar variable z is used. */            MPI_Scatter(Z,1,MPI_FLOAT,&z,1,MPI_FLOAT, root,MPI_COMM_WORLD);            /* This step is carried out on all nodes in parallel.*/            for(j=0;j   

Measuring performance

Clusters are built to perform, and you need to know how fast they are.It's common to think that the processor frequency determinesperformance. While this is true to a certain extent, it is of littlevalue in comparing processors from different vendors or even differentprocessor families from the same vendor because different processors dodifferent amounts of work in a given number of clock cycles. This wasespecially obvious when we compared vector processors with scalarprocessors in Part 1.

A more natural way to compare performance is to run some standard tests.Over the years a test known as the LINPACK benchmark has become a goldstandard when comparing performance. It was written by Jack Dongarramore than a decade ago and is still used by top500.org (see Resources for a link).

This test involves solving a dense system of N linear equations, where the number of floating-point operations is known (of the order of N^3).This test is well suited to speed test computers meant to runscientific applications and simulations because they tend to solvelinear equations at some stage or another.

The standard unit of measurement is the number of floating-point operations or flops per second (in this case, a flop is either an addition or a multiplication of a 64-bit number). The test measures the following:

  • Rpeak, theoretical peak flops. In a June 2005 report, IBM Blue Gene/L clocks at 183.5 tflops (trillion flops).
  • Nmax, the matrix size N that gives the highest flops. For Blue Gene this number is 1277951.
  • Rmax, the flops attained for Nmax. For Blue Gene, this number is 136.8 tflops.

To appreciate these numbers, consider this: What IBM BlueGene/L can doin one second, your home computer may take up to five days.

Now let's discuss how you can benchmark your own Linux cluster. Othertests besides LINPACK are the HPC Challenge Benchmark and the NASbenchmarks.

Benchmarking your Linux cluster

To run the LINPACK benchmark on your Linux cluster, you need to get theparallel version of LINPACK and configure it for the cluster. We'll gothrough this process step by step.

Warning: The following uses generic linear algebralibraries; use it only as a guideline. For a true test, get librariesthat have been optimized for your environment.

Step 1

Download hpl.tgz, the parallel (MPI) version of the LINPACK benchmark, from netlib.org (see Resources for a link).

Step 2

Download blas_linux.tgz, pre-compiled BLAS (Basic Linear AlgebraSubprograms), also from netlib.org. For simplicity, you can use apre-compiled reference implementation of BLAS for Linux, but for betterresults, you should use BLAS provided by your hardware vendor orauto-tune it using the open source ATLAS project .

Step 3

mkdir /home/linpack; cd /home/linpack (we'll install everything in /home).

Step 4

Uncompress and expand blas_linux.tgz. You should get an archive filecalled blas_linux.a. If you can see the file, ignore any errors you mayget.

Step 5

Uncompress and expand hpl.tgz. You'll get a directory called hpl.

Step 6

Copy any of the configuration files, such as the Make.Linux_PII_FBLASfile, from hpl/setup to the hpl directory and rename the copy in the hpldirectory Make.LinuxGeneric.

Step 7

Edit the file Make.LinuxGeneric as follows, changing the values to suit your environment:

TOPdir = /home/linpack/hpl
MPdir = /tmp/mpich-1.2.6
LAlib = /home/linpack/blas_linux.a
CC = /tmp/mpich-1.2.6/bin/mpicc
LINKER = /tmp/mpich-1.2.6/bin/mpif77

These five places specify the top-level directory of LINPACK from step1, the top-level directory of your MPICH installation, and the locationof the reference BLAS archive from step 2.

Step 8

Compile HPL now:

make arch=LinuxGeneric

If there are no errors, you will get two files, xhpl and HPL.dat, in /home/linpack/hpl/bin/LinuxGeneric.

Step 9

Before you run the LINPACK benchmark on your cluster, copy the entire/home/linpack to all the machines in your cluster. (If you created thecluster using OSCAR and have NFS shares configured, skip this step.)

Now cd to the directory with the executable you created in step 8 and run some tests (such as /tmp/mpich-1.2.6/bin/mpicc -np 4 xhpl). You should see some tests being executed with the results presented as GFLOPs.

Note: The above will run some standard tests based on default settingsfor matrix sizes. Use the file HPL.dat to tune your tests. Detailedinformation on tuning is in the file /home/linpack/hpl/TUNING.

The IBM Blue Gene/L

Now that you've built our own cluster, let's take a quick look at Blue Gene/L,the state-of-the-art in cluster-based supercomputers. Blue Gene/L isan engineering feat of massive proportions, and a discussion that doesjustice to it would be clearly outside the scope of this article. We'lljust touch the surface here.

Three years ago when the Earth Simulator, a vector processor,became the fastest supercomputer, many people predicted the demise ofthe cluster as a supercomputing platform and the resurrection of thevector processor; this turned out to be premature. Blue Gene/L beat theEarth Simulator decisively on the standard LINPACK benchmark.

Blue Gene/L is not built from commodity workstations, but it does usethe standard clustering concepts discussed in this series. It uses anMPI implementation derived from MPICH that we used in our cluster. Italso runs standard 32-bit PowerPC? CPUs (although these are based onsystem-on-a-chip or SoC technology) at 700MHz, which leads todramatically lower cooling and power requirements than any machine inits class.

Blue Gene/L is a big machine and has 65,536 compute nodes (each of whichruns a custom operating system) and 1,024 dedicated I/O nodes (runningthe Linux kernel). With such a large number of nodes, networking takesspecial importance and Blue Gene/L uses five different types ofnetworks, each optimized for a particular purpose.

A wealth of information on Blue Gene/L is available in a recent IBM Systems Journaldevoted entirely to it. In Blue Gene/L, we see a realization of ahighly scalable MPP and clear proof that the cluster-based computingparadigm is here to stay.


Resources

Learn

  • Read the first article in this two-part series, "Clustering fundamentals" (developerWorks, September 2005).

  • The Jamaica Project compilers explains parallelizing and dynamic compilers and supercomputer optimization using parallelism.

  • Parallel Virtual Machine (PVM)
    • PVM is a software package that permits a heterogeneous collection of UNIX? and/or Windows? computers hooked together by a network to be used as a single large parallel computer.
    • This downloadable PostScript article explains PVM's advantages over MPI.
    • This article is an excellent comparison of PVM and MPI.

  • Message Passing Interface (MPI)
    • The MPI Forum is an open group with representatives from many organizations that define and maintain the MPI standard.
    • MPICH is a freely available, portable implementation of MPI.
    • LAM/MPI is a high-quality open-source implementation of the MPI specification intended for production as well as research use.
    • This MPI installation guide can get your started.

  • OSCAR (Open Source Cluster Application Resources) is a snapshot of the best known methods for building, programming, and using HPC clusters.

  • Rocks Cluster is an open source Linux cluster implementation.

  • OpenPBS is the original version of the Portable Batch System.

  • Clumon is a cluster-monitoring system developed at the National Center for Supercomputing Applications to keep track of its Linux super-clusters.

  • Benchmarking
    • Top500.org has a good introduction to the LINPACK benchmark.
    • And here is an excellent paper by Jack Dongarra on the HPC Challenge benchmark and a thorough introduction to it.
    • The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers.
    • HPL is a portable implementation of the high-performance LINPACK benchmark for distributed-memory computers.
    • BLAS, the Basic Linear Algebra Subprograms, are routines that provide standard building blocks for performing basic vector and matrix operations.
    • The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation.

  • This IBM Journal of Research and Development highlights BlueGene/L.

  • Build your own
    • Building a Linux HPC Cluster with xCAT (Redbook, September 2002) guides system architects and engineers toward a basic understanding of cluster technology, terminology, and the installation of a Linux High-Performance Computing (HPC) cluster.
    • Linux HPC Cluster Installation (Redbook, June 2001) focuses on xCAT (xCluster Administration Tools) for installation and administration.
    • Building an HPC Cluster with Linux, pSeries, Myrinet, and Various Open Source Software (Redpaper, July 2004) explains how to bring a collection of pSeries nodes and a Myrinet interconnect to a state where a true HPC production workload can be run in a Linux clustered environment.
    • Build a heterogeneous cluster with coLinux and openMosix (developerWorks, February 2005) demonstrates how to build a mixed or hybrid HPC Linux cluster.
    • Linux Clustering with CSM and GPFS (Redbook, January 2004) focuses on the Cluster System Management and the General Parallel Filesystem for Linux clusters.

  • Find more resources for Linux developers in the developerWorks Linux zone.

Get products and technologies

  • Download benchmarks
    • Download the latest LINPACK results.
    • Run your own Java-based LINPACK on your PC.
    • Get the HPL benchmark.
    • Get the BLAS subroutines.
    • Download ATLAS.

  • Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.


Discuss

  • Get involved in the developerWorks community by participating in developerWorks blogs.


About the author

AdityaNarayan holds a BS/MS in Physics from Indian Institute of Technology,Kanpur, and founded QCD Microsystems. He has expertise in Windows andLinux internals, WebSphere, and enterprise platforms like J2EE and .NET.Aditya spends most of his time in New York. You can reach him at aditya_pda@hotmail.com.