Posts Tagged ‘PXE’

Reinstalling ROCKS compute cluster node

December 1, 2011

If you have any faulty HPC node and want to reinstall it for instance in case of hard drive replacement you should bare in mind several things:

  • Make sure xinetd is listening on 65 for tftpd requests on frontend.
  • Check for firewall rules. But you can simply switch it off during install. Otherwise you’ll get PXE-E32: TFTP open timeout.
  • Then you should configure your frontend to force compute node reinstallation. If you won’t do that you’ll just see PXE-M0F: Exiting HP PXE ROM or similar. Execute the following command on frontend: rocks set host boot <nodename> action=install.
  • In case you get an unable to read package metadata error during installation then go to /export/rocks/install/, remove rocks-dist folder and recreate installation tree by running rocks create distro.
  • After host installation put all  additional packages (like IB, MVAPICH, etc) into /share/apps and run rocks run host <nodename> “rpm -Uvh /share/apps/*.rpm”. Make necessary packages (like openibd and/or opensmd) to run upon startup via chkconfig and start them up. You may also need to copy some manually installed packages to compute node’s /opt directory.
  • In case you commented out faulty node earlier in /opt/torque/server_priv/nodes uncomment it and restart pbs_server service.

This is it. Now you should be good to go.

OSCAR or ROCKS?

February 20, 2011

My recent task was deployment of HPC cluster. It doesn’t make any sense to install each node separately and manually install all cluster software. There is simpler solution. Several software suites exist which allow you to automatically deploy head node and all compute nodes via PXE.

OSCAR

My first shot was OSCAR. Since I was actually redeploying  I had some old version of OSCAR installed. First surprise is – OSCAR doesn’t support upgrade. It effectively means you need to reinstall your head node. I tried to upgrade OSCAR manually but with no luck. It has complicated installation process and not obvious dependencies. After I reinstalled operating system and installed latest OSCAR package I couldn’t install it via its GUI. When I clicked button for fourth step I saw following error:

Tk::Error: Can’t set -options to `ARRAY(0xd34c710)’ for Tk::Optionmenu=HASH(0xd360660): No -label at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/Tk/Widget.pm line 256

An you can do nothing with that. Period.

Luckily new version of OSCAR just came out and after another full OS reinstall I had GUI working just fine. But when I tried to select Torque PBS for installation I ran into another error with broken dependency:

opkg-torque-server-2.1.13-1.noarch from unstable_rhel-5-x86_64 has depsolving problems

–> Missing Dependency: torque-modulefile is needed by package opkg-torque-server-2.1.13-1.noarch (unstable_rhel-5-x86_64)

Then I tried to contact developers. They issued several patches but for some weired reason their patches changed nothing. I decided to unselect Torque and install it manually later.  But I ran into another dependency error with MAUI since MAUI depends on Torque. I unselected MAUI. Then I ran into another dependency resolution error with LAM. I unselected LAM and after that I was able at least to install OSCAR packages on head node.

Next step was compute nodes deployment. After setting up correct PXE boot environment and booting one testing compute node I got failed installation due to lack of compute node’s NIC support. Luckily OSCAR has so called “Use Your Own Kernel (YOUK)” option which substitutes original kernel with head node’s kernel. It made it possible to start installation.

Unfortunately another error showed up. For some reason rsync connection between head and compute node breaks at random point in time. Rsync is the only way to deploy OS onto compute nodes. It means that story ends at this point. I’ve spent 3 weeks on OSCAR and was not able to deploy high performance computing cluster with help of it.

ROCKS

ROCKS was a breath of fresh air. One day was enough to install everything. The only issue I had is compute nodes deployment delay. Some nodes hung for 10-15 mins before downloading disto via TFTP. But it is almost not an issue.

opkg-torque-server-2.1.13-1.noarch from unstable_rhel-5-x86_64 has depsolving problems

–> Missing Dependency: torque-modulefile is needed by package opkg-torque-server-2.1.13-1.noarch (unstable_rhel-5-x86_64)