If you have any faulty HPC node and want to reinstall it for instance in case of hard drive replacement you should bare in mind several things:
- Make sure xinetd is listening on 65 for tftpd requests on frontend.
- Check for firewall rules. But you can simply switch it off during install. Otherwise you’ll get PXE-E32: TFTP open timeout.
- Then you should configure your frontend to force compute node reinstallation. If you won’t do that you’ll just see PXE-M0F: Exiting HP PXE ROM or similar. Execute the following command on frontend: rocks set host boot <nodename> action=install.
- In case you get an unable to read package metadata error during installation then go to /export/rocks/install/, remove rocks-dist folder and recreate installation tree by running rocks create distro.
- After host installation put all additional packages (like IB, MVAPICH, etc) into /share/apps and run rocks run host <nodename> “rpm -Uvh /share/apps/*.rpm”. Make necessary packages (like openibd and/or opensmd) to run upon startup via chkconfig and start them up. You may also need to copy some manually installed packages to compute node’s /opt directory.
- In case you commented out faulty node earlier in /opt/torque/server_priv/nodes uncomment it and restart pbs_server service.
This is it. Now you should be good to go.
Tags: cluster, high performance computing, HP, HPC, IB, InfiniBand, PXE, ROCKS, TFTP
Leave a Reply