Posts Tagged ‘ROCKS’

Export share in ROCKS

March 14, 2012

In my previous post I described how you can present an iSCSI LUN to a Linux host. I moved all home directories to this NAS share, but later I came to the conclusion that making separate share would be better. Users should have ability to quickly compile applications in their home directories. If home directories are also used as target storage for computational data, then during computation, iSCSI network link can become a bottleneck and slow down everything. That’s why I decided to separate them. It requires exporting additional share and it can be done very easily in ROCKS.

1. Mount the LUN to say /export/scratch

2. Make export by adding (all in one line) to /etc/exports

/export/scratch 192.168.111.128(rw,async,no_root_squash) 192.168.111.0/255.255.255.0(rw,async)

3. Restart nfs

/etc/rc.d/init.d/nfs restart

4. Add line to /etc/auto.share

scratch master.local:/export/&

5. Update 411 config

make -C /var/411

Now share is accessible by all compute nodes from /share/scratch.

Same process is described in ROCKS FAQ here.

Advertisement

Present NetApp iSCSI LUN to Linux host

March 7, 2012

Consider the following scenario (which is in fact a real case). You have a High Performance Computing (HPC) cluster where users usually generate hellova research data. Local hard drives on a frontend node are almost always insufficient. There are two options. First is presenting a NFS share both to frontend and all compute nodes. Since usually compute nodes  connect only to private network for communication with the frontend and don’t have public ip addresses it means a lot of reconfiguration. Not to mention possible security implications.

The simpler solution here is to use iSCSI.  Unlike NFS, which requires direct communication, with iSCSI you can mount a LUN to the frontend and then compute nodes will work with it as ordinary NFS share through the private network. This implies configuration of iSCSI LUN on a NetApp filer and bringing up iSCSI initiator in Linux.

iSCSI configuration consists of several steps. First of all you need to create FlexVol volume where you LUN will reside and then create a LUN inside of it. Second step is creation of initiator group which will enable connectivity between NetApp and a particular host.  And as a last step you will need to map the LUN to the initiator group. It will let the Linux host to see this LUN. In case you disabled iSCSI, don’t forget to enable it on a required interface.

vol create scratch aggrname 1024g
lun create -s 1024g -t linux /vol/scratch/lun0
igroup create -i -t linux hpc
igroup add hpc linux_host_iqn
lun map /vol/scratch/lun0 hpc
iscsi interface enable if_name

Linux host configuration is simple. Install iscsi-initiator-utils packet and add it to init on startup. iSCSI IQN which OS uses for connection to iSCSI targets is read from /etc/iscsi/initiatorname.iscsi upon startup. After iSCSI initiator is up and running you need to initiate discovery process, and if everything goes fine you will see a new hard drive in the system (I had to reboot). Then you just create a partition, make a file system and mount it.

iscsiadm -m discovery -t sendtargets -p nas_ip
fdisk /dev/sdc
mke2fs -j /dev/sdc1
mount /dev/sdc1 /state/partition1/home

I use it for the home directories in ROCKS cluster suite. ROCKS automatically export /home through NFS to compute nodes, which in their turn mount it via autofs. If you intend to use this volume for other purposes, then you will need to configure you custom NFS export.

Reinstalling ROCKS compute cluster node

December 1, 2011

If you have any faulty HPC node and want to reinstall it for instance in case of hard drive replacement you should bare in mind several things:

  • Make sure xinetd is listening on 65 for tftpd requests on frontend.
  • Check for firewall rules. But you can simply switch it off during install. Otherwise you’ll get PXE-E32: TFTP open timeout.
  • Then you should configure your frontend to force compute node reinstallation. If you won’t do that you’ll just see PXE-M0F: Exiting HP PXE ROM or similar. Execute the following command on frontend: rocks set host boot <nodename> action=install.
  • In case you get an unable to read package metadata error during installation then go to /export/rocks/install/, remove rocks-dist folder and recreate installation tree by running rocks create distro.
  • After host installation put all  additional packages (like IB, MVAPICH, etc) into /share/apps and run rocks run host <nodename> “rpm -Uvh /share/apps/*.rpm”. Make necessary packages (like openibd and/or opensmd) to run upon startup via chkconfig and start them up. You may also need to copy some manually installed packages to compute node’s /opt directory.
  • In case you commented out faulty node earlier in /opt/torque/server_priv/nodes uncomment it and restart pbs_server service.

This is it. Now you should be good to go.

HP BladeSystem c3000

October 29, 2011

We have High Performace Computing (HPC) cluster I’d like to show. It has 72 cores and 152GB of RAM in total. We use ROCKS as cluster middleware. Interconnect is DDR InfiniBand.

We have two groups of servers. First group is two BL2x220c  blades. Since they are double-sided it’s actually four servers. Each with two 4-core CPUs and 16GB of RAM. Second group consists of five BL280c. Each of them also has two 4-core CPUs but 24 GB of RAM. Eighth server is BL260c. This blade serves as master server.

Click pictures to enlarge.

BL280c blade server. This dude has 8 Xeon cores and 24GB of RAM.

Every component of HP BladeSystem c3000 is hot-swappable. Here I show how I disconnect Onboard Adminstrator on fully operational system.

Fans, power supplies and all interconnects are on the back.

Here is the 16-port DDR InfiniBand switch. Each port’s throughput is 80GB/s FDX.

Uplink ports for Onboard Administrator.

16 ports of Ethernet pass-through for blade servers.


Six power supplies in N+1 redundant configuration. Each is capable of 1200 Watts. 7200 Watts in total.

Inside blade server.

InfiniBand mezzanine. One such module is capable of 80Gb/s FDX.

If you are interested in benchmarking results find them here for pure IB and here for IBoIP.

ROCKS post install configuration

February 20, 2011

If you use ROCKS to deploy cluster you still need to do some manual configuration. What I personally did was:

1. Disable hyperthreading in BIOS. HT is not helpful for compute intensive tasks. It’s better suited for I/O intensive applications.

2. Add compute nodes to /opt/torque/server_priv/nodes in the following format:

compute-1-2 np=8

Where compute-1-2 is node’s domain name and np is the number of processors (cores).

3. Add master.local to /etc/hosts.equiv (for Torque).

4. Install InfiniBand stack. Most of IB RPMs are already installed but I also added:

libibumad, opensm-libs, opensm, ibutils, infiniband-diags, ibutils-libs, libibmad, libmlx4, swig

I just downloaded them by

# yumdownloader package-name

then put them into /share/apps and installed by

# rocks run host compute “rpm -Uvh /share/apps/*.rpm”

Then you will need to set openibd service to run upon startup on all compute nodes and opensmd service on any ONE of compute nodes.

5. Then you need to set up queues. There is a ‘default’ queue which is decent for simple setup. But I have nodes of two types and it’s handy to have separate queues. ‘qmgr’ is a binary which controls Torque queues. I made simple qmgr.in text file which contained:

create queue srail queue_type=execution
set queue srail started=true
set queue srail enabled=true
set queue srail resources_default.neednodes=srail

create queue mrail queue_type=execution
set queue mrail started=true
set queue mrail enabled=true
set queue mrail resources_default.neednodes=mrail

Then I fed this file to qmgr:

# qmgr < qmgr.in

To check qmgr configuration call:

# qmgr -c ‘p s’

At last change node records in /opt/torque/server_priv/nodes from

compute-1-2 np=8

to

compute-1-2 np=8 mrail

6. ROCKS don’t support LVM. It installs everything on first hard drive it finds. If you have two hard drives just mount second drive to /state/partition1/home and add home partition to /etc/exportfs.