Archive for the ‘HPC’ Category

Scalasca and trace conversion tools

October 8, 2012

In my “UPC related” project I tightly worked with tool called Scalasca. I utilized Cube GUI (as well as APIs) from Scalasca and performed conversion from my own tracing format to Cube’s (or Scalasca’s) EPILOG format to be able to use Cube.

Scalasca has several tracing format conversion tools integrated into it, which are handy to allow analysis of EPILOG traces in other tools like Jumpshot, Vampir, Paraver and others. Here I’ll describe how to install Scalasca with converters support. I worked with Scalasca 1.3, so keep in mind that this blog entry can be rather outdated.

Basic Scalasca installation

I prefer to install Scalasca without Cube, and then install Cube separately. Because Scalasca usually includes outdated version of Cube.

# ./configure --prefix=/opt/scalasca-1.3 --disable-gui

You need to install binutils-devel, otherwise you won’t be able to compile and will get the following error:

cannot find -liberty

Cube installation

You need qt4 and qt4-devel RPMs for Cube installation. Then go with:

# PATH=$PATH:/usr/lib64/qt4/bin ./configure --prefix=~fred/cube-3.3
# make
# make install

Scalasca with OTF, SLOG2 and Paraver support

The most interesting part starts here. To make conversion functionality available in Scalasca you need to perform several additional steps. You don’t need to install extra packages to get Paraver support. Just get and install Paraver from the official site. You will also need to install wxWigets library (wxGTK for example), wxPropertyGrid and Boost.

OTF and VTF3 are formats of Vampir performance analysis instrument. VTF3 is predecessor of OTF and isn’t needed. You can get OTF at the official site.

To get SLOG2 support, download Oracle Java JDK and JRE from the Oracle site and install rpms. I also needed to uninstall Compat Java:

yum remove openoffice.org-core-2.3.0-6.5.el5.x86_64 \
tomcat5-servlet-2.4-api-5.5.23-0jpp.7.el5.x86_64 \
xml-commons-apis-1.3.02-0.b2.7jpp.10.x86_64 \
hsqldb-1.8.0.9-1jpp.2.x86_64 \
tomcat5-jsp-2.0-api-5.5.23-0jpp.7.el5.x86_64 \
gjdoc-0.7.7-12.el5.x86_64 antlr-2.7.6-4jpp.2.x86_64 \
xerces-j2-2.7.1-7jpp.2.x86_64 xalan-j2-2.7.0-6jpp.1.x86_64 \
bsf-2.3.0-11jpp.1.x86_64 bsh-1.3.0-9jpp.1.x86_64 \
xml-commons-resolver-1.1-1jpp.12.x86_64 java java-gcj-compat

After OTF and Java are installed invoke configure script with the following flags and install Scalasca:

# CFLAGS=”-I/usr/java/latest/include -I/usr/java/latest/include/linux” ./configure --prefix=/opt/scalasca-1.3 --with-otf=/opt/otf-1.7 --with-slog2=~/install/slog2sdk-1.2.6

As the last step go to SLOG2 sources and type:

# ./configure --prefix=/opt/slog2sdk-1.2.6 --with-java=/usr/java/latest --with-trace-libdir=`scalasca --config --prefix`/lib --with-trace-name=elg --disable-build CC=`scalasca --config --cc`
# make
# make install

Errors during installation

If you try to compile the code right away you can potentially get an error:

gcc -m64 -I. -I../../config -I../../utils -I../../episode -OpenMPnotAvailable -c epk_omplock.c
cc1: error: invalid option argument ‘-OpenMPnotAvailable’

To get rid of it go to scalasca-1.3.0/build-linux-gnu-nompi and change line 63 in Makefile.defs from

OMPFLAG    = -OpenMPnotAvailable

to

OMPFLAG    = -fopenmp

If you get an error:

gcc gen_slogconf.c -o gen_slogconf
gen_slogconf.c: In function ‘main’:
gen_slogconf.c:14: error: ‘__int64’ undeclared (first use in this function)
gen_
slogconf.c:14: error: (Each undeclared identifier is reported only once
gen_slogconf.c:14: error: for each function it appears in.)

then add the following two lines to gen_slogconf.c:

#include <sys/types.h>
#define __int64 int64_t

Another possible error, which I got, has the following output:

/usr/java/latest/include -I/usr/java/latest/include/linux ../../../utils/epk_archive.c -c -o so_epk_archive.o
gcc gen_slogconf.c -o gen_slogconf
./gen_slogconf > trace_impl.h
make[5]: *** [trace_impl.h] Error 25
make[5]: Leaving directory `/root/Install/scalasca-1.3.0/build-linux-gnu-nompi/build/epik/epilog/tools/elg2slog2′
make[4]: *** [all] Error 2
make[4]: Leaving directory `/root/Install/scalasca-1.3.0/build-linux-gnu-nompi/build/epik/epilog/tools’
make[3]: *** [all] Error 2
make[3]: Leaving directory `/root/Install/scalasca-1.3.0/build-linux-gnu-nompi/build/epik’
make[2]: *** [all-epik] Error 2
make[2]: Leaving directory `/root/Install/scalasca-1.3.0/build-linux-gnu-nompi/build’
make[1]: *** [all] Error 2
make[1]: Leaving directory `/root/Install/scalasca-1.3.0/build-linux-gnu-nompi’
make: *** [all] Error 2

Solution here is to substitute gen_slogconf.c from distribution with the following code:

#include <stdio.h>
#include <stdint.h>

int main() {
  printf("/* -- DO NOT CHANGE ---------------------------- */\n");
  printf("/* -- AUTOMATICALLY GENERATED BY gen_slogconf -- */\n\n");
  printf("#define HAVE_JNI_H 1\n");
  printf("#define STDC_HEADERS 1\n");
  printf("#define SIZEOF_INT %zd\n", sizeof(int));
  printf("#define SIZEOF_LONG %zd\n", sizeof(long));
  printf("#define SIZEOF_LONG_LONG %zd\n", sizeof(long long));
  printf("#define SIZEOF_VOID_P %zd\n", sizeof(void*));
  printf("#define SIZEOF___INT64 %zd\n", sizeof(int64_t));

  return 0;
}

After that just enter directory /root/Install/scalasca-1.3.0/build-linux-gnu-nompi/build/epik/epilog/tools/elg2slog2 and run make manually by yourself.

Usage examples

To convert EPILOG trace gathered by Scalasca to OTF use the following command:

# LD_LIBRARY_PATH=/opt/otf-1.7/lib elg2otf trace.elg

Convertion to Paraver format is done using elg2prv:

# elg2prv trace.elg

Then to view converted trace in Paraver GUI run:

# LD_LIBRARY_PATH=/opt/boost-1.43.0/lib:/opt/wxGTK-2.8.11/lib:/opt/wxpropgrid-1.4.12/lib:/opt/paraver-kernel-1.0/lib/paraver-kernel /opt/wxparaver-3.99/bin/wxparaver trace.elg

Advertisement

UPC performance analysis

October 8, 2012

GASP is an interface for instrumenting UPC applications. It allows performance analysis tools to gather information on application execution. If you want to gather such data you need to implement a number of call-back functions, which will receive all information when application is run. Then you compile your implementation along with the user code. In a nutshell GASP is simply a collection of functions which are inserted at the beginning as well as at the end of UPC library functions. Each time function is called you receive data through call-backs. Then it’s up to you what you do with that data: count how many times function has been called, calculate how much data has been transfered between threads, find out how much time has been spent in barriers and so on. GASP is described in detail in specification, there are also several articles on that topic. You can find that information at GASP official site. GASP is implemented in Berkeley UPC, there is also a limited support in GCC UPC.

upc_dump

There are several tools that can utilize GASP to analyze UPC applications. The most dumb tool is upc_dump, which is included into Berkeley UPC compiler. It doesn’t actually analyze application, it simply dumps all tracing information. Here is how you compile UPC application with upc_dump:

> /opt/bupc-runtime-2.8.0/opt_inst/bin/upcc -network=udp --inst-toolname=dump /opt/bupc-runtime-2.8.0/opt_inst/bin/gasp-dump/gaspu.upc -L/opt/bupc-runtime-2.8.0/opt_inst/bin/gasp-dump -lgasp-dump -T=4 --inst bin_file

First of all you need to compile your code with instrumented version of UPC from opt_inst subdirectory. I described how to compile opt_inst earlier in this post. With --inst-toolname you provide the name of instrumentation tool (it seems that it doesn’t matter what you specify here). Then you compile your application along with gaspu.upc. Usually developers of performance analysis tools put GASP related call-back implementations into a library and all upcalls to UPC code into a separate source file (such as gaspu.upc for upc_dump), since library is compiled with C compiler and cannot include UPC code. Then you provide path to the library with -L flag and name of the library with -l flag. The last --inst flag instructs the compiler to instrument all UPC library functions. --inst-functions also instruments all user-defined functions. Additionally, --inst-local allows to instrument all local accesses to shared memory within the thread.

upc_trace

upc_trace is another tool distributed with Berkeley UPC, but it has some performance analysis functionality. Unlike upc_dump you don’t need to manually specify all flags to the compiler. What you need to do is to compile an opt_trace subbuild. And upcrun application with the -trace flag. When execution is completed you will have several trace files which you pass on to upc_trace tool. Performance analysis data is represented as output text file. You can control which types of events are gathered using GASNET_TRACEMASK environment variable.

Parallel Performance Wizard

PPW is probably the one performance analysis tool for UPC with rich functionality. PPW installation is simple:

# ./configure --prefix=/home/fred/ppw-2.6 --with-upc=/home/fred/bupc-runtime-2.10.0
# make
# make install

Then to perform simple tests use PPW wrappers for upcc and upcrun:

> ppwupcc -network=udp --inst-functions -T=32 upc_code.c
> UPC_NODES=”n1 n2 n3 n4″ ppwrun --output=upc_code.par upcrun -n 32 -nodes=4 bin_file

PPW has GUI where you pass the .par file and see what happened in the application.

Advanced notes on Unified Parallel C installation

October 8, 2012

I already described basic Berkeley UPC compiler installation here. So now lets go deeper in details.

Backend Compilers

Basically UPC compiler is a translator from UPC language to C. After translation is done, backend C compiler is invoked to actually compile the code. On Linux clusters GCC is used by default, if you have Intel, Sun or any other high performance compiler installed, then use CC and CXX flags in UPC runtime configure step:

./configure CC=icc CXX=icpc --prefix=/opt/bupc-runtime-2.12.1-icc
./configure CC=suncc CXX=sunCC --prefix=/opt/bupc-runtime-2.10.0-suncc

Optional UPC builds

By default Berkeley UPC is installed in two configurations: debug (with GASnet assertions enabled and debugging info compiled in) and opt (optimized version for everyday use). You will see debug and opt subdirectories in your working UPC runtime build. But you can install additional versions of runtime for other uses.

Berkeley UPC has integrated tracing facility. If you upcrun application with the -trace flag, tracing data is collected and you can analyze it with upc_trace utility. Tracing build can be compiled by using opt_trace multiconf option:

./configure --prefix=/opt/bupc-runtime-2.12.1 --with-multiconf=+opt_trace

Berkeley UPC has integrated callbacks (called GASP) for third-party instrumenting utilities. Instrumentation allows developers of performance analysis tools to gather all sorts of information about UPC program execution. Like functions called, their arguments, etc. If you want to develop your own UPC performance analysis tool you can use this feature during development and instruct users to build opt_trace version of UPC to be able to use your tool later.

./configure --prefix=/opt/bupc-runtime-2.12.1 --with-multiconf=+opt_inst

You can debug UPC applications with dbg build, if you are a developer and use instrumented build of UPC and need to debug it, then build a dbg_inst version. There was a dbg_inst.patch (find link below) to add dbg_inst functionality to UPC, but it’s already integrated into compiler as far as I remember.

./configure --prefix=/opt/bupc-runtime-2.12.1 --with-multiconf=+dbg_inst

There was also another bug which broke dbg_inst in 2.12.1 (which was originally implemented in 2.10.0) with the following errors:

/root/install/berkeley_upc-2.12.1/gasnet/gasnet_trace.c: In function ‘gasneti_trace_finish’:

/root/install/berkeley_upc-2.12.1/gasnet/gasnet_trace.c:988: error: ‘gasneti_mallocreport_filename’ undeclared (first use in this function)

/root/install/berkeley_upc-2.12.1/gasnet/gasnet_trace.c:988: error: (Each undeclared identifier is reported only once

/root/install/berkeley_upc-2.12.1/gasnet/gasnet_trace.c:988: error: for each function it appears in.)

To resolve this issue apply mallocreport.patch00 (find link below). But if you use recent Berkeley UPC build you won’t see this bug.

Block size

If you work with huge matrixes and want to distribute them in large chunks of consecutive rows, then you will run into UPC limitation of block size. UPC pack pointer representation into one 64 bit integer. By default 34 bits are allocated for memory address, 10 bits for threads and 20 bits for phase (or block size). 2^20 is basically a 1048576 elements which is a very small number. You can redistribute bits with --with-sptr-packed-bits=value value=’phase,thread,addr’ configure option, but then you will either have small address space or small number of threads.

Another option is to use --enable-sptr-struct configure flag which changes shared pointer representation from int to struct. It will increase block size to 2^(32-1) which is 2147483647. But it could also be too small if you conduct performance measurement and need to run your code for 1 thread. Then the whole matrix is a one huge block. 50000×50000 matrix is already hit the limit.

If 2^(32-1) is not enough, then the last option for you is to use row distributed algorithm instead of row-block distributed.

POSIX shared memory problems with InfiniBand

UPC support two one-node inter-thread shared memory communication types: POSIX shared memory and SYSV shared memory. POSIX is configured by default. If you want to register large amounts of shared memory with many PSHM processes using --shared-heap key you can see errors like these:

*** FATAL ERROR: Unexpected error Bad address (rc=1 errno=14) when registering the segment

NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.

*** Caught a fatal signal: SIGABRT(6) on node 29/32

To solve this problem reinstall runtime using the following options:

./configure --prefix=/opt/bupc-runtime-2.12.1 --enable-pshm --disable-pshm-posix --enable-pshm-sysv

Bug when building translator

For some vendor-build GCC releases, like Red Hat, older versions of translator fail to compile with error like:

/usr/bin/ld: ipl_summarize_util.o: relocation R_X86_64_PC32 against `Phi_To_Idx_Map’ can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: ld returned 1 exit status

It’s a bug number 2202 in UPC Bugzilla and is described here. Solution and patch are described in post 17. Find copy of patch below.

UPC I/O support for large files

UPC have parallel I/O extension. In version 2.14.0 and earlier by default UPC I/O supported files 2GB in length. It led to upc_all_fread_shared() returning -1 “Invalid argument” for data above the 2GB limit. To change defaults from 2^(32-1) bits size to 2^(64-1) use BUPC_IO_64 variable during runtime configure step:

./configure CC=”gcc -DBUPC_IO_64″ CXX=”g++ -DBUPC_IO_64″ --prefix=/opt/bupc-runtime-2.12.1

Replace GCC with your own compiler.

SUN compiler issues

If you run into an error (I had it in version 2.10.0):

“/home/fred/install/berkeley_upc-2.10.0/upcr_profile.c”, line 36: left operand must be modifiable lvalue: op “=”
cc: acomp failed for /home/fred/install/berkeley_upc-2.10.0/upcr_globfiles.c

Apply patch sun_const_field.patch00 (find link below).  Additional info can be found in Berkeley UPC Bugzilla, bug number 2696.

Another bug (not an error, but an annoying warning) looks like numerous warnings throughout compilation:

“/home/fred/install/berkeley_upc-2.10.0/upcr_atomic.h”, line 876: warning: result of paste undefined and not portable: 64_ (E_PASTE_RESULT_NOT_TOKEN)
“/home/fred/install/berkeley_upc-2.10.0/upcr_atomic.h”, line 876: warning: result of paste undefined and not portable: 64_cswap (E_PASTE_RESULT_NOT_TOKEN)

To get rid of it apply patch not_token.patch00 (find link below). It’s described in the same 2696.

Links to patches

Unfortunately WordPress doesn’t allow to upload .txt files due to security reasons. Other formats, such as .doc or .pdf will break the lines. So I decided to give direct links when possible and provide contents of patch in text converted to .jpg format in case direct link will break in future. The drawback is that you will have to type it yourself or OCR it.

Basic UPC compiler installation

October 8, 2012

There were times when I used to work heavily on one UPC-related project. I had several issues with installation of the Berkeley UPC compiler. I don’t want that information to be wasted, so I will share it here with everyone in several posts. I worked with Berkeley UPC versions until 2.14.0. So this post can already be obsolete for you.

Compilation

Berkeley UPC compiler consists of a runtime and a translator (you can use online translator if you want). They are installed separately. I used several flags in configure stage I’d like to explain.

First flag is --without-mpi-cc. UPC supports several underlying transports to exchange messages between threads. The most basic is udp, I worked primarily on ibv (InfiniBand). UPC also installs mpi transport by default. It’s slow and it requires MPI installation, so I never used it and prefered to disable it.

Flag --disable-aligned-segments is ususally a must in Linux environments. There is a security feature which randomizes the virtual address space. This doesn’t allow UPC threads to use the same base memory address on all nodes. It introduces some additional pointer arithmetic in the deference of a UPC pointer-to-shared. So you either disable Linux virtual address space randomization feature or use this flag.

It is stated that UPC can have issues with GCC 4.0.x through 4.2.x as a backened compiler. GCC can misoptimize a shared-local access such that it deterministically read or write an incorrect value. So you cannot install UPC without using the --enable-allow-gcc4 flag. I didn’t have any issues with GCC ever, so you can safely use it.

Post-installation tasks

After installation is completed you need to point UPC runtime to your locally installed translator. Otherwise it will try to use online translator on the Berkeley web-site. Under each UPC build subdirectory (opt, dbg, etc) replace translator directive in etc/upcc.conf to:

translator = /opt/translator-installation-dir/targ

You need to correctly configure NFS and SHH on your nodes, so that they could access and run your application binary files without password. If you use firewall you need to open all necessary ports. For me they were:

111 tcp, udp for portmapper
2049 tcp for nfs
892 tcp, udp for mountd
32803 tcp, 32769 udp for lockd
662 tcp,udp for statd

Since lockd uses dynamic ports, uncomment static port configuration in /etc/sysconfig/nfs:

LOCKD_TCPPORT=32803
LOCKD_UDPPORT=32769
MOUNTD_PORT=892
STATD_PORT=662

SSH is also just a walk in the park:

# su – fred
> ssh-keygen -t rsa
> cp /home/fred/.ssh/id_rsa.pub /home/fred/.ssh/authorized_keys
> chmod /home/fred/.ssh/authorized_keys 600
> chown fred:fred /home/fred/.ssh/authorized_keys

Usage example

> upcc --network=udp source_code.c
> UPC_NODES=”node1 node2 node3 node4″ upcrun -n 32 bin_file

You choose conduit by using --network flag, UPC_NODES environment variable sets hosts which will run the code and -n sets the number of threads.

Possible problems

You can encounter the following error when you run UPC application:

*** FATAL ERROR: Got an xSocket while spawning slave process: connect() failed while creating a connect socket (111:Connection refused)
bash: line 1: 10535 Aborted ‘./a.out’ ‘__AMUDP_SLAVE_PROCESS__’ ‘node1:49655’

This could happen if you use firewall and didn’t uncomment static port configuration for lockd daemon. Each time it uses random port which doesn’t match to what you entered in firewall configuration and fail to communicate.

If you get an error which starts with:

Address node1_ip_address maps to node1, but this does not map back to the address – POSSIBLE BREAK-IN ATTEMPT!
AMUDP sendPacket returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
from function sendPacket
at /root/install/berkeley_upc-2.8.0/gasnet/other/amudp/amudp_reqrep.cpp:99
reason: Invalid argument

or

AMUDP sendPacket returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
from function sendPacket
at /root/install/berkeley_upc-2.8.0/gasnet/other/amudp/amudp_reqrep.cpp:99
reason: Invalid argument

then you have /etc/hosts misconfiguration. Don’t add compute node hostname to 127.0.0.1 line in /etc/hosts. There should be only real address line. /etc/hosts on each node should look something like this:

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.0.0.1 node1
10.0.0.2 node2
10.0.0.3 node3

Export share in ROCKS

March 14, 2012

In my previous post I described how you can present an iSCSI LUN to a Linux host. I moved all home directories to this NAS share, but later I came to the conclusion that making separate share would be better. Users should have ability to quickly compile applications in their home directories. If home directories are also used as target storage for computational data, then during computation, iSCSI network link can become a bottleneck and slow down everything. That’s why I decided to separate them. It requires exporting additional share and it can be done very easily in ROCKS.

1. Mount the LUN to say /export/scratch

2. Make export by adding (all in one line) to /etc/exports

/export/scratch 192.168.111.128(rw,async,no_root_squash) 192.168.111.0/255.255.255.0(rw,async)

3. Restart nfs

/etc/rc.d/init.d/nfs restart

4. Add line to /etc/auto.share

scratch master.local:/export/&

5. Update 411 config

make -C /var/411

Now share is accessible by all compute nodes from /share/scratch.

Same process is described in ROCKS FAQ here.

Reinstalling ROCKS compute cluster node

December 1, 2011

If you have any faulty HPC node and want to reinstall it for instance in case of hard drive replacement you should bare in mind several things:

  • Make sure xinetd is listening on 65 for tftpd requests on frontend.
  • Check for firewall rules. But you can simply switch it off during install. Otherwise you’ll get PXE-E32: TFTP open timeout.
  • Then you should configure your frontend to force compute node reinstallation. If you won’t do that you’ll just see PXE-M0F: Exiting HP PXE ROM or similar. Execute the following command on frontend: rocks set host boot <nodename> action=install.
  • In case you get an unable to read package metadata error during installation then go to /export/rocks/install/, remove rocks-dist folder and recreate installation tree by running rocks create distro.
  • After host installation put all  additional packages (like IB, MVAPICH, etc) into /share/apps and run rocks run host <nodename> “rpm -Uvh /share/apps/*.rpm”. Make necessary packages (like openibd and/or opensmd) to run upon startup via chkconfig and start them up. You may also need to copy some manually installed packages to compute node’s /opt directory.
  • In case you commented out faulty node earlier in /opt/torque/server_priv/nodes uncomment it and restart pbs_server service.

This is it. Now you should be good to go.

HP BladeSystem c3000

October 29, 2011

We have High Performace Computing (HPC) cluster I’d like to show. It has 72 cores and 152GB of RAM in total. We use ROCKS as cluster middleware. Interconnect is DDR InfiniBand.

We have two groups of servers. First group is two BL2x220c  blades. Since they are double-sided it’s actually four servers. Each with two 4-core CPUs and 16GB of RAM. Second group consists of five BL280c. Each of them also has two 4-core CPUs but 24 GB of RAM. Eighth server is BL260c. This blade serves as master server.

Click pictures to enlarge.

BL280c blade server. This dude has 8 Xeon cores and 24GB of RAM.

Every component of HP BladeSystem c3000 is hot-swappable. Here I show how I disconnect Onboard Adminstrator on fully operational system.

Fans, power supplies and all interconnects are on the back.

Here is the 16-port DDR InfiniBand switch. Each port’s throughput is 80GB/s FDX.

Uplink ports for Onboard Administrator.

16 ports of Ethernet pass-through for blade servers.


Six power supplies in N+1 redundant configuration. Each is capable of 1200 Watts. 7200 Watts in total.

Inside blade server.

InfiniBand mezzanine. One such module is capable of 80Gb/s FDX.

If you are interested in benchmarking results find them here for pure IB and here for IBoIP.

Benchmarking InfiniBand 2

March 2, 2011

Instead of benchmarking IBoIP as I did in my previous post it’s much more interesting to test pure InfiniBand. There is a so-called OSU Micro-Benchmarks suite from Ohio State. It has several tests: unidirectional bandwidth (bw), bidirectional bandwidth (bibw), unidirectional one-sided get (get_bw), unidirectional one-sided put (put_bw), bidirectional one-sided put (put_bibw) and same latency tests. Here is what I got on my Mellanox ConnectX cards:

OSCAR or ROCKS?

February 20, 2011

My recent task was deployment of HPC cluster. It doesn’t make any sense to install each node separately and manually install all cluster software. There is simpler solution. Several software suites exist which allow you to automatically deploy head node and all compute nodes via PXE.

OSCAR

My first shot was OSCAR. Since I was actually redeploying  I had some old version of OSCAR installed. First surprise is – OSCAR doesn’t support upgrade. It effectively means you need to reinstall your head node. I tried to upgrade OSCAR manually but with no luck. It has complicated installation process and not obvious dependencies. After I reinstalled operating system and installed latest OSCAR package I couldn’t install it via its GUI. When I clicked button for fourth step I saw following error:

Tk::Error: Can’t set -options to `ARRAY(0xd34c710)’ for Tk::Optionmenu=HASH(0xd360660): No -label at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/Tk/Widget.pm line 256

An you can do nothing with that. Period.

Luckily new version of OSCAR just came out and after another full OS reinstall I had GUI working just fine. But when I tried to select Torque PBS for installation I ran into another error with broken dependency:

opkg-torque-server-2.1.13-1.noarch from unstable_rhel-5-x86_64 has depsolving problems

–> Missing Dependency: torque-modulefile is needed by package opkg-torque-server-2.1.13-1.noarch (unstable_rhel-5-x86_64)

Then I tried to contact developers. They issued several patches but for some weired reason their patches changed nothing. I decided to unselect Torque and install it manually later.  But I ran into another dependency error with MAUI since MAUI depends on Torque. I unselected MAUI. Then I ran into another dependency resolution error with LAM. I unselected LAM and after that I was able at least to install OSCAR packages on head node.

Next step was compute nodes deployment. After setting up correct PXE boot environment and booting one testing compute node I got failed installation due to lack of compute node’s NIC support. Luckily OSCAR has so called “Use Your Own Kernel (YOUK)” option which substitutes original kernel with head node’s kernel. It made it possible to start installation.

Unfortunately another error showed up. For some reason rsync connection between head and compute node breaks at random point in time. Rsync is the only way to deploy OS onto compute nodes. It means that story ends at this point. I’ve spent 3 weeks on OSCAR and was not able to deploy high performance computing cluster with help of it.

ROCKS

ROCKS was a breath of fresh air. One day was enough to install everything. The only issue I had is compute nodes deployment delay. Some nodes hung for 10-15 mins before downloading disto via TFTP. But it is almost not an issue.

opkg-torque-server-2.1.13-1.noarch from unstable_rhel-5-x86_64 has depsolving problems

–> Missing Dependency: torque-modulefile is needed by package opkg-torque-server-2.1.13-1.noarch (unstable_rhel-5-x86_64)

ROCKS post install configuration

February 20, 2011

If you use ROCKS to deploy cluster you still need to do some manual configuration. What I personally did was:

1. Disable hyperthreading in BIOS. HT is not helpful for compute intensive tasks. It’s better suited for I/O intensive applications.

2. Add compute nodes to /opt/torque/server_priv/nodes in the following format:

compute-1-2 np=8

Where compute-1-2 is node’s domain name and np is the number of processors (cores).

3. Add master.local to /etc/hosts.equiv (for Torque).

4. Install InfiniBand stack. Most of IB RPMs are already installed but I also added:

libibumad, opensm-libs, opensm, ibutils, infiniband-diags, ibutils-libs, libibmad, libmlx4, swig

I just downloaded them by

# yumdownloader package-name

then put them into /share/apps and installed by

# rocks run host compute “rpm -Uvh /share/apps/*.rpm”

Then you will need to set openibd service to run upon startup on all compute nodes and opensmd service on any ONE of compute nodes.

5. Then you need to set up queues. There is a ‘default’ queue which is decent for simple setup. But I have nodes of two types and it’s handy to have separate queues. ‘qmgr’ is a binary which controls Torque queues. I made simple qmgr.in text file which contained:

create queue srail queue_type=execution
set queue srail started=true
set queue srail enabled=true
set queue srail resources_default.neednodes=srail

create queue mrail queue_type=execution
set queue mrail started=true
set queue mrail enabled=true
set queue mrail resources_default.neednodes=mrail

Then I fed this file to qmgr:

# qmgr < qmgr.in

To check qmgr configuration call:

# qmgr -c ‘p s’

At last change node records in /opt/torque/server_priv/nodes from

compute-1-2 np=8

to

compute-1-2 np=8 mrail

6. ROCKS don’t support LVM. It installs everything on first hard drive it finds. If you have two hard drives just mount second drive to /state/partition1/home and add home partition to /etc/exportfs.