迈博瑞生物膜怎么样:Installing a large Linux cluster, Part 4: Node installation and GPFS cluster configuration

来源:百度文库 编辑:九乡新闻网 时间:2024/05/02 01:08:06

Installing a large Linux cluster, Part 4: Node installation and GPFS cluster configuration

Large Linux cluster storage backend

Graham White (gwhite@uk.ibm.com), Systems Management Specialist, IBM, Software GroupMandie Quartly (mandie_quartly@uk.ibm.com), IT Specialist, IBM

Summary:  Create a working Linux? cluster from many separate pieces of hardware andsoftware, including System x? and IBM TotalStorage? systems. Part 4 providesthe second half of the instructions you need to set up the storage backend,including installing General Parallel File System (GPFS) codeon each node and configuring Qlogic adapters for storage nodes. Finally, thisarticle takes you through the steps to create a GPFS cluster.

View more content in this series

Tag this!Update My dW interests (Log in | What's this?) Skip to help for Update My dW interests

Date:  14 Jun 2007
Level:  Advanced
Also available in:  Russian

Activity:  8010 views
Comments:   0 (View | Add comment - Sign in)

Average rating (52 votes)
Rate this article

Introduction

This is the fourth and final part of a series that covers the installation andsetup of a large Linux computer cluster. The purpose of the series is to bringtogether in one place up-to-date information from various sources in the publicdomain about the process required to create a working Linux cluster from manyseparate pieces of hardware and software. These articles are not intended toprovide the basis for the complete design of a new large Linux cluster; refer tothe relevant reference materials and IBM Redbooks? mentioned throughout for generalarchitecture pointers.

This series addresses systems architects and systems engineers to plan andimplement a Linux cluster using the IBM eServer? Cluster 1350 framework (see Resources for more information about the framework). Some parts might alsobe relevant to cluster administrators for educational purposes and during normalcluster operation. Each part of this article refers to the same exampleinstallation.

Part 1of the series provides detailed instructions for setting up the hardware for thecluster.Part 2takes you through the next steps after hardware configuration: softwareinstallation using the IBM systems management software, Cluster Systems Management(CSM), and node installation.

Part 3 and this article describe the storage back-end of the cluster, covering the storagehardware configuration and the installation and configuration of the IBM sharedfile system, General Parallel File System (GPFS).Part 3takes you through the architecture of the storage system, hardware preparation,and details about setting up a Storage Area Network. This, the fourth and finalpart of the series, provides details about CSM specifics related to the storagebackend of the example cluster, notably performing node installation for thestorage system, and GPFS cluster configuration.

Detailing node installationspecifics

This section details the cluster server management (CSM) specifics related to thestorage backend of the example cluster. These include the installation of theGPFS code on each node and the configuration of theQlogic adapters for the storage nodes. Note that this configuration does not haveto be performed using CSM; it can be done manually. The example in this articleuses CSM to almost totally automate the installation of a new server, including a storageserver.

Reviewing matters of architecture

Before reading this section, you can benefit from reviewing the General clusterarchitecture section inPart 1of this series. You can also benefit from reading the section on storage architecture inPart 3.An understanding of the architecture will give you the context you need to makethe best use of the information that follows.

Installing storage nodes in the correct order

Installing in the correct order is necessary to work around the ROM overflowsection issue described later, because the xSeries? 346 systems used inthis configuration do not have RAID 7K cards. Complete the steps in the followingorder:

  1. Run the command csmsetupks -vxn on the management server.

  2. Disconnect the storage server from the SAN to avoid installation of the operating system on SAN disks, which are discovered first.

  3. Run installnode -vn on the management server.

  4. Press F1 from the console when the storage node reboots to enter the BIOS.

  5. Go into Start Options and change PXEboot from disabled to enabled for planar ethernet 1.

  6. Let the node reboot, and installation starts.

  7. Monitor installation from the terminal server, letting the node boot fully.

  8. Log into the node and monitor the csm installation log /var/log/csm/install.log.

  9. Reboot the node when the post reboot tasks have finished.

  10. Press F1 from the console when the node restarts to enter the BIOS.

  11. Go into Start Options and change PXEboot to disabled.

  12. Plug in the SAN cables and let the node boot fully.

  13. Configure the paths to disk using MSJ as explained under Configuring paths to disks and load balancing.

Providing passwordlessroot access between nodes

GPFS requires that all nodes in the GPFS cluster have the ability to access eachother using the root ID with no password provided. GPFS uses this internode accessto allow any node in the GPFS cluster to run relevant commands on other nodes. Forthe example here, secure shell (ssh) is used to provide access, however youcan also use remote shell (rsh). To do this, create a cluster-wide key andassociated configuration files, and distribute them with CSM following these steps:

  1. Create two new directories at /cfmroot/root/.ssh and /cfmroot/etc/ssh.

  2. Create an RSA key pair, public and private keys for authentication, by typing
    ssh-keygen -b 1024 -t rsa -f /cfmroot/etc/ssh/ssh_host_rsa_key -N "" -C "RSA_Key"

  3. Create a DSA key pair, public and private keys for authentication, by typing
    ssh-keygen -b 1024 -t dsa -f /cfmroot/etc/ssh/ssh_host_dsa_key -N "" -C "DSA_Key"

  4. Create an authorization file containing the public keys, as shown below. This is the file SSH uses to determine whether to prompt for a password.
    cat /root/.ssh/id_rsa.pub > /cfmroot/root/.ssh/authorized_keys2                cat /root/.ssh/id_dsa.pub >> /cfmroot/root/.ssh/authorized_keys2                cat /cfmroot/etc/ssh/ssh_host_rsa_key.pub >> /cfmroot/root/.ssh/authorized_keys2                cat /cfmroot/etc/ssh/ssh_host_dsa_key.pub >> /cfmroot/root/.ssh/authorized_keys2

  5. Stop CSM from maintaining the known_hosts file, as shown below. This is a file containing names of hosts. If a host is listed in the file, SSH does not prompt the user for connection confirmation. CSM attempts to maintain this file, but in a fixed cluster environment with passwordless root access, this can be a hindrance.
    stopcondresp NodeFullInstallComplete SetupSSHAndRunCFM                startcondresp NodeFullInstallComplete RunCFMToNode                perl -pe 's!(.*update_known_hosts.*)!#$1!' -i /opt/csm/csmbin/RunCFMToNode

  6. Generate a system-wide known_hosts file. This is best done by creating a script, as shown below. Run the script and direct the output to /cfmroot/root/.ssh/known_hosts.
    #!/bin/bash                RSA_PUB=$(cat "/cfmroot/etc/ssh/ssh_host_rsa_key.pub")                DSA_PUB=$(cat "/cfmroot/etc/ssh/ssh_host_dsa_key.pub")                for node in $(lsnodes); do                ip=$(grep $node /etc/hosts | head -n 1 | awk '{print $1}')                short=$(grep $node /etc/hosts | head -n 1 | awk '{print $3}')                echo $ip,$node,$short $RSA_PUB                echo $ip,$node,$short $DSA_PUB                done

    This example script works for a single interface. You can modify it trivially to allow passwordless connection across multiple interfaces. The format of the known_hosts file is beyond the scope of this article, but it is useful to take advantage of the comma-separated host names for each line.

  7. Allow passwordless root access by linking in the generated keys, as shown below.
    cd /cfmroot/root/.ssh                ln -s ../../etc/ssh/ssh_host_dsa_key id_dsa                ln -s ../../etc/ssh/ssh_host_dsa_key.pub id_dsa.pub                ln -s ../../etc/ssh/ssh_host_rsa_key id_rsa                ln -s ../../etc/ssh/ssh_host_rsa_key.pub id_rsa.pub

  8. You might want to ensure this configuration is installed onto each system at installation time, before the operating system is rebooted for the first time. CSM makes no guarantee about the order in which things are run post-installation, so if any post-installation task relies on this configuration being present, it could possibly fail. It might also succeed and give the impression of an inconsistent failure. For example, you might have a GPFS post-installation script and you need to add a node into the GPFS cluster and mount any GPFS file systems. One way to achieve this would be to create a tar archive of all the files created here and unpack them using a CSM post-installation pre-reboot script.

Defining GPFS-related CSM groups

For this example, two main CSM groups are defined for use during the GPFSconfiguration, as shown below.

  • StorageNodes, which includes only those nodes that are attached directly to the SAN, such as, nodegrp -w "Hostname like 'stor%'" StorageNodes.

  • NonStorageNodes, which includes all other nodes that are part of the GPFS cluster, such as nodegrp -w "Hostname not like 'stor%'" NonStorageNodes.

These groups are used during installation to ensure servers that perform storagenode roles receive specific binary files and configuration files, which aredetailed below. Note that this section does not cover the detailed process ofinstallation as performed by CSM. SeePart 1andPart 2of this series for instructions for this process.

To summarize, the installation process goes through the following stages:

  1. PXE boot/DHCP from installation server
  2. NFS installation from installation server
  3. Pre-reboot scripts
  4. Reboot
  5. Post-reboot scripts
  6. CFM file transfer
  7. CSM post installation configuration

The configuration changes in this article occur during the pre-reboot andCFM file transfer stages.

InstallingGPFS RPMs

GPFS requires each cluster member to have a base set of GPFS RPMs installed. Thelevel of GPFS used for the example installation was 2.3.0-3. The installation of these RPMs is atwo-stage process: installing the 2.3.0-1 base level and then updating to 2.3.0-3.The RPMs used for this installation are:

  • gpfs.base
  • gpfs.docs
  • gpfs.gpl
  • gpfs.msg.en_US

Note: Because the example uses GPFS Version 2.3, installation of Reliable ScalableCluster Technology (RSCT) and creation of a peer domain is not required. Versions ofGPFS before Version 2.3 do require those manual steps.

CSM can install the GPFS RPMs in a variety of ways. This article recommendsinstalling the RPMs during the base operating system installation phase. CSM providesan installation and update directory structure to contain customized RPMs, however,this might not work very well for an initial RPM installation followed by an upgradeto the same RPMs as required by GPFS 2.3.0-3.

One alternative method is to write pre-reboot post-installation scripts forCSM to install the RPMs as required. In this case, copy all the GPFS RPMs,including the update RPMs, to a directory under/csminstall/Linux on the management server. Thedirectory CSM usually reserves for script data is/csminstall/csm/scripts/data, which will be mounted onthe node during installation, making the needed RPMs available using NFS.

Write the installation script/csminstall/csm/scripts/installprereboot/install-gpfs.shto install GPFS. Here is an example installation script:

#! /bin/bash            # Installs GPFS filesets and updates to latest levels            # CSMMOUNTPOINT environment variable is set by CSM            DATA_DIR=$CSMMOUNTPOINT/csm/scripts/data            cd $DATA_DIR            rpm -ivh gpfs.*2.3.0-1*.rpm            rpm -Uvh gpfs.*2.3.0-3*.rpm            echo 'export PATH=$PATH:/usr/lpp/mmfs/bin' > /etc/profile.d/gpfs.sh            

Once you install GPFS on the storage servers, you might also want to automaticallyinstall the FAStT MSJ utility, which can be done in silent (non-interactive) mode.MSJ is used for configuration of the Qlogic adapters, failover, and multipathing,which is described in detail under HBA configuration. Theinstallation is not RPM based, so it is not easily integrated into CSM by default.To accomplish the installation, you can add a script to the end of the GPFSinstallation to check whether the node is a storage server and installMSJ. To install in silent mode, use the commandFAStTMSJ*_install.bin -i silent

ConfiguringQlogic failover

The example cluster uses the Qlogic qla2300 driver, version 7.01.01, for theQlogic QLA 2342 adapters. Each of the nodes in the storage node group has two ofthese PCI adapters. The qla2300 driver comes standard with the Red Hat EnterpriseLinux 3 update 4 distribution. However, you need to make the following changesto suit the purposes of the example cluster:

  • Change the qla2300 driver to perform failover. This enables you to take advantage of more than one path to disk and allow failover to occur if the preferred path fails. This is not set by default.

    Make the first change using a script that is run before reboot during installation by CSM. The script that does this is in the directory /csminstall/csm/scripts/installprereboot/. The script contains the following commands:

    #! /bin/bash                # Adds lines to /etc/modules.conf to enable failover for the qla2300 drivers                echo "options qla2300 ql2xfailover=1 ql2xmaxsectors=512 ql2xmaxsgs=128" >>                /etc/modules.conf                echo "Updating initrd with new modules.conf set up"                mkinitrd -f /boot/initrd-`uname -r`.img `uname -r`                

  • Set the preferred path to disk on each host to match those set on each DS4500. Use odd-numbered arrays seen through HBA0, and use even-numbered arrays seen through HBA1.

    The second change needs to be made manually whenever a storage node is reinstalled. The details are covered in the Defining HBA configuration on storage servers section.

Tuning the GPFS network

The example offers several lines to add to the /etc/sysctl.conffile on each node to tune the network for GPFS. This is done using a post-rebootinstallation script using CSM. The script is in the directory/csminstall/csm/scripts/installpostreboot and contains the following lines:

FILE=/etc/sysctl.conf            # Adds lines to /etc/sysctl.conf for GPFS network tuning            echo "# CSM added the next 8 lines to the post installation script for            GPFS network tuning" >> $FILE            echo "# increase Linux TCP buffer limits" >> $FILE            echo "net.core.rmem_max = 8388608" >> $FILE            echo "net.core.wmem_max = 8388608" >> $FILE            echo "# increase default and maximum Linux TCP buffer sizes" >> $FILE            echo "net.ipv4.tcp_rmem = 4096 262144 8388608" >> $FILE            echo "net.ipv4.tcp_wmem = 4096 262144 8388608" >> $FILE            echo "# increase max backlog to avoid dropped packets" >> $FILE            echo "net.core.netdev_max_backlog=2500" >> $FILE            # Following lines are not related to GPFS tuning            echo "# Allow Alt-SysRq" >> $FILE            echo "kernel.sysrq = 1" >> $FILE            echo "# Increase ARP cache size" >> $FILE            echo "net.ipv4.neigh.default.gc_thresh1 = 512" >> $FILE            echo "net.ipv4.neigh.default.gc_thresh2 = 2048" >> $FILE            echo "net.ipv4.neigh.default.gc_thresh3 = 4096" >> $FILE            echo "net.ipv4.neigh.default.gc_stale_time = 240" >> $FILE            # Reset the current kernel parameters            sysctl -p /etc/sysctl.conf            

Distributing the GPFS portability layer

The GPFS portability layer (PL) is kernel-specific, and it must be createdseparately for each operating system level within your cluster. The purpose of thePL and the details of creation for the example cluster are described in the Producing and installing the portability layer section. CSM manages the distribution ofthe PL binaries using the CFM file transfer facility. Copy thePL binaries into the /cfmroot/usr/lpp/mmfs/bindirectory on the management servers and name them so that they are onlydistributed to the nodes with specific kernel versions in the relevant groups. Forexample:

/cfmroot/usr/lpp/mmfs/bin/dumpconv._            /cfmroot/usr/lpp/mmfs/bin/lxtrace._            /cfmroot/usr/lpp/mmfs/bin/mmfslinux._            /cfmroot/usr/lpp/mmfs/bin/tracedev._            

Note that in a large cluster, in order to reduce load on CFM, it is possible toadd these four files into a custom RPM and install with GPFS using the methodoutlined above for installing the GPFS RPMs.

Automating the addition of new nodes to a GPFS cluster

Simply installing the GPFS RPMS and portability layer is not enough to mount andconfigure file systems within the GPFS cluster on the newly installed nodes. In asmall cluster, this could be managed manually. However, scaling up to largercluster sizes makes it worth automating this step. This can be done using the CSMmonitoring capabilities by monitoring for completed new node installations and kickingoff a script to configure and mount GPFS on the new node in the cluster.

Listing 1 shows an example script that can be used to configure GPFS. You mightneed to modify the script slightly for your configuration. Listing 1 providesthe basics. The script takes the name of a node (as passed by CSMmonitors), adds this to the GPFS cluster, and attempts to start GPFS on that nodewith some trivial error handling.


Listing 1. Example script for configuring GPFS
#!/bin/bash            # CSM condition/response script to be used as a response to the InstallComplete            # condition.  This will attempt to add the node to the GPFS cluster, dealing            # with some common failure conditions along the way.  Only trivial attempts are            # made at problem resolution, advanced problems are left for manual            # intervention.            # Note requires the GPFS gpfs-nodes.list file.  This file should contain a list            # of all nodes in the GPFS cluster with client/manager and            # quorum/non-quorum details suitable for passing to the mmcrcluster command.            # Output is sent to /var/log/csm/            # Returned error codes:            # 1 - GPFS is already active            # 2 - unable to read the gpfs-nodes.list file            # 3 - node name not present in the gpfs-nodes.list file            # 4 - node is a quorum node            # 5 - unable to add node to cluster (mmaddnode failed)            # 6 - unable to start GPFS on the node (mmstartup failed)            # set this to the location of your node list file            gpfs_node_list=/etc/gpfs-nodes.list            # set this to the interface GPFS is using for communication            gpfs_interface=eth1            PATH=$PATH:/usr/lpp/mmfs/bin # ensure GPFS binaries are in the PATH            log_file=/var/log/csm/`basename $0`.log # log to /var/log/csm/            touch $log_file            # Get the node short name as set by RSCT condition ENV var ERRM_RSRC_NAME            node=`echo $ERRM_RSRC_NAME | cut -d. -f1`            (            [ ! -r "$gpfs_node_list" ] && echo " ** error: cannot read GPFS            node list $gpfs_node_list" && exit 2            echo            echo "--- Starting run of `basename $0` for $node at `date`"            # Is the node a quorum node? If so exit.            quorum_status=`grep $node $gpfs_node_list | cut -d: -f2 | cut -d- -f2`            if [ -n "$quorum_status" ]; then            if [ "$quorum_status" = "quorum" ]; then            echo "** error: this is a quorum node, stopping..."            exit 4            else            node_s=`grep $node $gpfs_node_list | cut -d: -f1`            fi            else            echo "** error: could not find node $node in GPFS node list $gpfs_node_list"            exit 3            fi            # Find out if the node is already part of the cluster            if mmlscluster | grep $node >/dev/null; then            # check the node status            status=`mmgetstate -w $node | grep $node | awk '{print $3}'`            if [ "$status" = "active" ]; then            echo "** error: this node already appears to have GPFS active!"            exit 1            fi            # attempt to remove node from cluster            echo "Node $node is already defined to cluster, removing it"            # attempt to disable storage interface on node            if ssh $node $ifdown $gpfs_interface; then            mmdelnode $node            ssh $node ifup $gpfs_interface            else            echo "** error: could not ssh to $node, or ifdown $gpfs_interface failed"            fi            fi            # try to add node to GPFS cluster            if mmaddnode $node; then            echo "Successfully added $node to GPFS cluster, starting GPFS on $node"            if mmstartup -w $node; then            echo "mmstartup -w $node succeeded"            else            echo "** error: cannot start GPFS on $node, please investigate"            exit 6            fi            else            echo "** error: could not add $node to GPFS cluster"            exit 5            fi            ) >>$log_file 2>&1            

You can use CSM to automatically run the script shown in Listing 1 when a newnode has finished the base operating system installation so that when it boots, theGPFS file systems are automatically mounted. First you need to define the scriptas a response mechanism in the CSM monitor. For example: mkresponse -n SetupGPFS -s /path/to/script/SetupGPFS.sh SetupGPFS.

You now have a response called SetupGPFS that will run your script. Next youshould associate this response to the default CSM conditionNodeFullInstallComplete, as follows:startcondresp NodeFullInstallComplete SetupGPFS.

Now CSM will automatically run the script from the management server any time youinstall a new node. On the CSM management server you should now be able to see theNodeFullInstallComplete condition associated with theSetupGPFS response when you run thelscondresp command. The condition or response should belisted as Active.

Addressing ROM overflow

There is a known issue with the amount of ROM space available on an xSeries 346that creates PCI allocation errors during boot. Messages indicate that the system ROMspace is full, and it has no more room for additional adapters that use ROM space(see Resources for more details).

This problem affects the storage nodes where, if PXE boot is enabled, there isnot sufficient space for the Qlogic PCI adapters to initialize properly. One workaround to this is the following:

  1. Disable PXE boot on the Broadcom PCI card used for the GPFS network. Using the downloadable diag facility b57udiag -cmd, choose the device and then disable PXE boot.

  2. Use PXE boot to install the node using CSM, and then disable PXE boot for both onboard adapters using BIOS (hence the order described in the Installing storage nodes in the correct order section.

Another workaround to avoid this issue is to use a RAID 7K card in each xSeries346. This reduces the amount of ROM the SCSI BIOS uses and allows the QlogicBIOS to load successfully, even with PXE boot enabled.

Defining HBA configuration on storage servers

The HBAs used on the xSeries 346 storage servers in the example cluster are theIBM DS4000 FC2-133 Host Bus Adapter (HBA) models. These are also known as Qlogic2342 adapters. The example uses firmware version 1.43 and, as mentioned in theprevious section, the v7.01.01-fo qla2300 driver. The-fo on this driver denotes failover, which is not thedefault option for this driver. This is enabled by changing the settings in the/etc/modules.conf on each storage node. This is setusingCSM during install and is described in the Configuring Qlogicfailover section.

The next section describes the steps needed to update firmware and settings on theHBAs on each storage server and the manual process required on each reinstall toenable load balancing between the two HBAs.

Downloading HBA firmware

You can download the firmware for the FC2-133 HBAsfrom the IBM System x support Web site (see Resources). The firmware can beupdated using IBM Management Suite Java or using a bootablediskette and the flasutil program.

Configuring HBA settings

For the example cluster, the following settings were changed from the default onthe HBAs. These values are in the README provided with the driver download.You can make this change using the Qlogic BIOS, which you can reach on boot using-q when prompted, or using the MSJ utility. Here are the settings:

  • Host adapter settings
    • Loop reset delay: 8
  • Advanced adapter settings
    • LUNs per target: 0
    • Enable target reset : Yes
    • Port down retry count: 12

Installing IBM ManagementSuite Java

IBM FAStT Management Suite Java (MSJ) is a Java-based GUI application thatmanages the HBAs in the storage servers. It can be used for configuration anddiagnostics. See Resources for a link to download thesoftware.

The example setup uses CSM to install MSJ on every storage node as part ofthe GPFS installation. The binary is part of the tar file containing the GPFSRPMs, which CFS distributes during CSM node installation. A post scriptuncompresses this tar file, which subsequently runs the installationscript contained inside the tar file. The example uses the 32-bit FAStT MSJ in thisinstallation to avoid potential problems installing the 64-bit version. Theexample script uses the following command to install MSJ:FAStTMSJ*_install.bin -i silent.

This installs both the application and the agent. Note that because this is a32-bit version of MSJ, and even though the example uses the silent installation, thecode looks for and loads 32-bit versions of some libraries. Therefore, use the32-bit version of XFree86-libs installed, as well as the 64-bit version includedwith the base 64-bit installation. The 32-bit libraries are contained in theXFree86-libs-4.3.0-78.EL.i386.rpm, which is included in the tar file. Theinstallation of this rpm is handled by the install.sh script, which then installsMSJ.

Configuring paths to disk and load balancing

MSJ is required on each storage node to manually configure paths to the arrays onthe DS4500s and load balancing between the two HBAs on each computer. If thisconfiguration was not performed, by default the arrays would all be accessedby the first adapter on each computer, HBA0, and consequently the controller A oneach DS4500. By spreading the disks between the HBAs, and hence the controllers onthe DS4500s, you balance the load and enhance the performance of the back end.

Note that configuration of load balancing is a manual step that must be performedon each storage node each time it is reinstalled. For the example cluster, hereare the steps to configure load balancing:

  1. Open a new window on a local computer from the newly installed server with xforwarding set up (ssh -X).

  2. In one session, run # qlremote.

  3. In another session, run # /usr/FAStT MSJ & to launch the MSJ GUI.

  4. From the MSJ GUI, highlight one of the adapters under the HBA tab and choose Configure. A window similar to that shown in Figure 1 appears.

    Figure 1. View of MSJ when selecting a DS4500




  5. To enable load balancing, highlight the storage subsystem represented by right-clicking the node name, and choose the following from the menu: LUNs > Configure LUNs. The LUN configuration window appears. You can automatically configure load balancing by choosing Tools > Load Balance. You' then see a window similar to that shown in Figure 2.

    Figure 2. View of MSJ when configuring failover




  6. When the logical drives are configured, the LUN configuration window closes, saving the configuration to the host system in the Port Configuration window (which has a default password of config). If the configuration is saved successfully, you see a confirmation. The configuration is saved as a file called /etc/qla2300.conf. New options should have been added to the qla2300 driver line in /etc/modules.conf to indicate that this file exists and should be used.

  7. Switch back to the window where the qlremote process was started and stop it, using -c. This is an important step.

  8. To enable the new configuration, reload the driver module qla2300. This cannot be done if the disk is mounted on the Fibre Channel subsystem attached to an adapter that uses this driver. Configure the host adapter driver to be loaded through an initial RAM disk, which applies the configuration data for redundant disks when loading the adapter module at boot time. Note that whenever the configuration of the logical drives changes, this procedure must be followed to save a valid configuration to the system.

One of the most efficient ways to use MSJ in a setup wheremore than one storage node needs load balancing configured is to keep MSJ open onone node, run qlremote on each of the other nodes, and then use the one MSJsession to connect to the others in the same half.

Configuring a GPFS cluster

This section covers in detail the steps taken in the creation of a GPFS cluster.It assumes that all nodes have been installed and configured as described earlierin this article, or that the following configuration hasbeen performed manually:

  • GPFS RPMs are installed on each computer.

  • PATH has been changed to include the GPFS binary directory.

  • A storage interface is configured.

  • Root can ssh between nodes without a password.

  • Network tuning settings in sysctl are complete.

  • NSD servers can see a SAN disk.

You can find a detailed description of the GPFS architecture for the example cluster in the "Storage architecture" section inPart 3of this series.

Read this section of the article in parallel with the GPFSdocumentation (see Resources), in particular the following:

  • GPFS V2.3 Administration and Programming Reference, which contains details of many administration tasks and the GPFS commands.

  • GPFS V2.3 Concepts, Planning, and Installation Guide, which details planning considerations for a GPFS cluster and steps to take during installation of a new cluster.

  • GPFS V2.3 Problem Determination Guide, which contains steps to take when troubleshooting, and it contains GPFS error messages.

Producing and installing the portability layer

The GPFS portability layer (PL) is a set of binaries that need to be builtlocally from source code to match the Linux kernel and configuration on a computerthat is to be part of a GPFS cluster. For the example cluster, this was done on one of thestorage nodes. The resulting files were copied to each node using CSM and CFM. (See theDistributing the GPFS portability layer section for more details). Thisis a valid method, because all computers are the same architecture and use the samekernel. The instructions to build the GPFS PL can be found in /usr/lpp/mmfs/src/README. The process forthe example cluster is as follows:

  1. Export SHARKCLONEROOT=/usr/lpp/mmfs/src.

  2. Type cd /usr/lpp/mmfs/src/config, cp site.mcr.proto site.mcr.

  3. Edit the new file site.mcr to match the configuration to be used. Leave the following lines uncommented:
    • #define GPFS_LINUX
    • #define GPFS_ARCH_X86_64
    • LINUX_DISTRIBUTION = REDHAT_AS_LINUX
    • #define LINUX_DISTRIBUTION_LEVEL 34
    • #define LINUX_KERNEL_VERSION 2042127
    (Note that a # does not indicate a comment.)

  4. Type cd /usr/lpp/mmfs/src.

  5. Create the GPFS PL using make World.

  6. Copy the GPFS PL to the /usr/lpp/mmfs/bin directory using make InstallImages. The GPFS PL consists of the following four files:
    • tracedev
    • mmfslinux
    • lxtrace
    • dumpconv

  7. Copy a set of these files, one for each of the relevant kernels used, into the CSM structure for distribution using CFM.

Creating a GPFS cluster

You create the GPFS cluster for this example using several distinct steps. Whileall the steps are not necessary, it is a good method to deal with the different typesof nodes in the cluster (storage nodes or others).

The first step is to create a cluster containing only the storage nodes and thequorum node: five nodes in total. Use node descriptor files when creating thecluster that contain the short hostnames of the storage interface of all the nodesto be included, followed by the following information:

  • Manager or client: Defines whether the node should form part of the pool from which the configuration and file system managers are picked. The example cluster includes only the storage nodes in this pool.

  • Quorum or nonquorum: Defines whether the node should be counted as a quorum node. The quorum nodes in the example cluster are the storage nodes and the tiebreaker node quor001.

The command to create the cluster is the following:

mmcrcluster -n stor.nodes -C gpfs1 -p stor001_s -s stor002_s -r /usr/bin/ssh -R            /usr/bin/scp            

  • The -C flag sets the name of the cluster.

  • The -p sets the primary configuration server node.

  • The -s sets the secondary configuration server node.

  • The -r sets the full path for the remote shell program to be used by GPFS.

  • The -R sets the remote file copy program to be used by GPFS.

Here is the stor.nodes node descriptor file used inthe example:

stor001_s:manager-quorum            stor002_s:manager-quorum            stor003_s:manager-quorum            stor004_s:manager-quorum            quor001_s:client-quorum            

Use entries similar to_s:client-nonquorum in laterstages for all the other nodes to be added to the cluster, such as computenodes, user nodes, and management nodes.

Enabling unmountOnDiskFail on quorum node

The next step is to enable the unmountOnDiskFail option on the tiebreakernode using mmchconfig unmountOnDiskFail-yes quor001_s.This prevents false disk errors in the SAN configuration from being reported tothe file system manager.

Defining network shared disks

The next step is to create the disks used by GPFS using the commandmmcrnsd a€“F disc#.desc. Running this command creates a globalname for each disk, which is a necessary step, because disks might have different /dev names oneach node in the GPFS cluster. Run this command on all disks to be usedfor the GPFS file system. At this point, define the primary and secondary NSDservers for each disk; these are used for I/O operations on behalf of the NSDclients, which have no local access to the SAN storage.

The -F flag is used to point to a file containing diskdescriptors for disks to be defined as NSDs. For manageability in the examplecluster, complete this process once on the LUNs presented by each DS4500 andonce on the tiebreaker disk. Each array or LUN on each DS4500 has a descriptor in thefiles used. Following is an example line from disk1.desc:

sdc:stor001_s:stor002_s:dataAndMetadata:1:disk01_array01S

Following are the fields in this line, in order:

  • Local disk name on primary NSD server

  • Primary NSD server

  • Secondary NSD server

  • Type of data

  • Failure group

  • Name of resulting NSD

By using the above descriptor files, define the following three failure groupsin this configuration:

  • The disks in the first DS4500, that is disk01.

  • The disks in the second DS4500, that is disk02.

  • The tiebreaker disk on the quorum node.

Starting GPFS

The next step is to start GPFS cluster-wide following these steps:

  1. Start GPFS on all of the NSD servers at the same time to prevent NSDs from being marked as down. Use the following command: mmstartup -w stor001_s,stor002_s,stor003_s,stor004_s.

  2. Start GPFS on all other nodes that are not NSD servers (including the tiebreaker node). Use the following command: mmstartup -w quor001_s,mgmt001_s,...

  3. Start GPFS on all compute nodes from the management node. Use the following command: dsh -N ComputeNodes /usr/lpp/mmfs/bin/mmstartup.

  4. Check the status of all nodes by monitoring the /var/adm/log/mmfs.log.latest file on the current file system manager (found using the command mmlsmgr ) and the output from the following: mmgetstate -w dsh -N ComputeNodes /usr/lpp/mmfs/bin/mmgetstate.

This method might seem overly cautious, but it has been chosen as a scalable methodthat will work for a very large cluster. An alternative to the steps above is touse the command mmstartup a€“a. This works for smallerclusters, but it can take a long time to return for a larger cluster where nodesmight be unreachable for various reasons, such as network issues.

Creating GPFS file system

For the example, one large GPFS file system is created using all the NSDsdefined to GPFS. Note that the command used takes as an argument the altered diskdescriptor files from the mmcrnsd command above. Thisrequires that you concatenate the output from each step in the creation of theNSDs into one file.

The example cluster uses the following settings:

  • All NSDs (set using -F)

  • Mountpoint: /gpfs

  • Automount: yes (set using -A)

  • Blocksize: 256KB (set using -B)

  • Replication: two copies of both data and metadata (set using -m, -M, -r, -R)

  • Estimated number of nodes mounting file system 1200 (set using -n)

  • Quotas enabled (set using -Q)

Here is the complete command:

 mmcrfs /gpfs /dev/gpfs -F disk_all.desc -A yes -B 256K -m 2 -M 2            -r 2 -R 2 -n            1200 -Q yes            

After creating /gpfs, it is mounted manually for thefirst time. After this, with automount enabled, it mounts automatically when anode starts GPFS.

Enabling quotas

The -Q flag on the abovemmcrfs command enables quotas on the/gpfs file system. Quotas can be defined for individualusers or groups of users. A default quota level has also been set that applies toany new user or group. Default quotas are turned on using the commandmmdefquotaon. Default quotas are edited using thecommand mmdefedquota. This command opens an edit windowin which you can specify the limits. Following is an example of setting limits for thequota:

gpfs: blocks in use: 0K, limits (soft = 1048576K, hard = 2097152K)            inodes in use: 0, limits (soft = 0, hard = 0)            

You can edit specific quotas for a user or group using the commandmmedquota a€“u . A user candisplay his quota by using the command mmlsquota. Thesuperuser can display the status of the quotas for the file system using thecommand mmrepquota gpfs.

Tuning

This cluster is configured so that GPFS starts automatically whenever a serverboots by adding an entry in /etc/inittab using thecommand mmchconfig autoload=yes.

Use GPFS pagepool to cache user data and file systemmetadata. The pagepool mechanism allows GPFS toimplement read, as well as write, requests asynchronously. Increasing the size ofpagepool increases the amount of data or metadata thatGPFS can cache without requiring synchronous I/O. The default value for pagepoolis 64 MB. The maximum GPFS pagepool size is 8 GB. Theminimum allowed value is 4 MB. On Linux systems, the maximumpagepool size is half of the physical memory in thecomputer.

The optimal size of the pagepool depends on the needsof the application and effective caching of its re-accessed data. For systemswith applications that access large files, reuse data, benefit from GPFS prefetchingof data, or have a random I/O pattern, increasing the value forpagepool might prove beneficial. However, if the value isset too high, GPFS will not start.

For the example cluster, use the value of 512 MB forpagepool for all nodes in the cluster.

Optimizing with network settings

To optimize the performance of the network and, hence, GPFS, enable jumbo framesby setting the MTU size for the adapter for the storage network to 9000. Keep/proc/sys/net/ipv4/tcp_window_scaling enabled, becauseit is the default setting. The TCP window settings are tuned using CSM scriptsat installation time to add the following lines to the/etc/sysctl.conf file on both the NSD servers and NSDclients:

# increase Linux TCP buffer limits            net.core.rmem_max = 8388608            net.core.wmem_max = 8388608            # increase default and maximum Linux TCP buffer sizes            net.ipv4.tcp_rmem = 4096 262144 8388608            net.ipv4.tcp_wmem = 4096 262144 8388608            # increase max backlog to avoid dropped packets            net.core.netdev_max_backlog=2500            

Configuring DS4500 settings

Storage server cache settings can impact GPFS performance if they are not setcorrectly. The example uses the following settings on the DS4500s, as recommended in theGPFS documentation:

  • Read cache: enabled
  • Read ahead multiplier: 0
  • Write cache: disabled
  • Write cache mirroring: disabled
  • Cache block size: 16K

Conclusion

That is it! You should have successfully installed a large Linux clusterfollowing the example in this series of articles. Apply the principles to your owninstallation for another successful large Linux cluster installation.


Resources

Learn

  • Explore the first three parts of this series:
    • Installing a large Linux cluster, Part 1: Introduction and hardware configuration
    • Installing a large Linux cluster, Part 2: Management server configuration and node installation
    • Installing a large Linux cluster, Part 3: Storage and shared file systems

  • See Retain Tip H183415 at the IBM PC Support Web site for more details on the ROM overflow problem.

  • Refer to the IBM GPFS documentation library.

  • See the IBM TotalStorage DS4500 system reference materials:
    • IBM TotalStorage DS4500 Web page.
    • IBM DS4500 support page.

  • Check out the IBM TotalStorage DS4000 EXP710 fiber channel storage expansion unit reference materials:
    • General IBM EXP710 product page.
    • IBM EXP710 support page.

  • Find the IBM TotalStorage SAN Switch H16 switch reference materials at:
    • General IBM SAN Switch H16 product page.
    • IBM SAN Switch H16 support page .

  • Want more? The developerWorks IBM Systems zone hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials.

  • Stay current with developerWorks technical events and webcasts.

Get products and technologies

  • Get firmware for the FC2-133 HBAs from the IBM System x support Web site.

  • Download IBM FAStT Management Suite Java (MSJ) from the IBM DS4500 download page.

  • Get the latest version of Storage Manager for your hardware from the DS4500 download page.

  • Build your next development project with IBM trial software for download directly from developerWorks.

Discuss

  • Participate in the discussion forum.

  • Exchange information with other developers on the IBM Systems forums and developerWorks blogs.

About the authors

GrahamWhite is a systems management specialist in the Linux IntegrationCentre within Emerging Technology Services at the IBM Hursley Parkoffice in the United Kingdom. He is a Red Hat Certified Engineer, and hespecializes in a wide range of open-source, open-standard, and IBMtechnologies Graham's areas of expertise include LAMP, Linux, security,clustering, and all IBM Systems hardware platforms. He received a BScwith honors in Computer Science with Management Science from ExeterUniversity in 2000.

MandieQuartly is an IT specialist with the IBM UK Global Technology Servicesteam. Mandie performs a cross-brand role, with current experience inboth Intel and POWER? platform implementations as well as AIX and Linux(Red Hat and Suse). She specializes in the IBM product General ParallelFile System (GPFS). She received a PhD in astrophysics from theUniversity of Leicester in 2001.