Posts Tagged ‘IB’

EMC Isilon Overview

February 20, 2014

isilon_logo_188x110OneFS Overview

EMC Isilon OneFS is a storage OS which was built from the ground up as a clustered system.

NetApp’s Clustered ONTAP for example has evolved from being an OS for HA-pair of storage controllers to a clustered system as a result of integration with Spinnaker intellectual property. It’s not necessarily bad, because cDOT shows better performance on SPECsfs2008 than Isilon, but these systems still have two core architectural differences:

1. Isilon doesn’t have RAIDs and complexities associated with them. You don’t choose RAID protection level. You don’t need to think about RAID groups and even load distribution between them. You don’t even have spare drives per se.

2. All data on Isilon system is kept on one volume, which is a one big distributed file system. cDOT use concept of infinite volumes, but bear in mind that each NetApp filer has it’s own file system beneath. If you have 24 NetApp nodes in a cluster, then you have 24 underlying file systems, even though they are viewed as a whole from the client standpoint.

This makes Isilon very easy to configure and operate. But its simplicity comes at a price of flexibility. Isilon web interface has few options to configure and not very feature rich.

Isilon Nodes and Networking

In a nutshell Isilon is a collection of a certain number of nodes connected via 20Gb/s DDR InfiniBand back-end network and either 1GB/s or 10GB/s front-end network for client connections. There are three types of Isilon nodes S-Series (SAS + SSD drives) for transactional random access I/O, X-Series (SATA + SSD drives) for high throughput applications and NL-series (SATA drives) for archival or not frequently used data.

If you choose to have two IB switches at the back-end, then you’ll have three subnets configured for internal network: int-a, int-b and failover. You can think of a failover network as a virtual network in front of int-a and int-b. So when the packet comes to failover network IP address, the actual IB interface that receives the packet is chosen dynamically. That helps to load-balance the traffic between two IB switches and makes this set up an active/active network.

131_22

On the front-end you can have as many subnets as you like. Subnets are split between pools of IP addresses. And you can add particular node interfaces to the pool. Each pool can have its own SmartConnect zone configured. SmartConnect is a way to load-balance connections between the nodes. Basically SmartConnect is a DNS server which runs on the Isilon side. You can have one SmartConnect service on a subnet level. And one SmartConnect zone (which is simply a domain) on each of the subnet pools. To set up SmartConnect you’ll need to assign an IP address to the SmartConnect service and set a SmartConnect zone name on a pool level. Then you create an “A” record on DNS for the SmartConnect service IP address and delegate SmartConnect DNS zone to this IP. That way each time you refer to the SmartConnect zone to get access to a file share you’ll be redirected to dynamically picked up node from the pool.

SmartPools

Each type of node is automatically assigned to what is called a “Node Pool”. Nodes are grouped to the same pool if they are of the same series, have the same amount of memory and disks of the same type and size. Node Pool level is one of the spots where you can configure protection level. We’ll talk about that later. Node Pools are grouped within Tiers. So you can group NL node pool with 1TB drives and NL node pool with 3TB drives into an archive tier if you wish. And then you have File Pool Policies which you can use to manage placement of files within the cluster. For example, you can redirect files with specific extension or file size or last access time to be saved on a specific node pool or tier. File pool policies also allow you to configure data protection and override the default node pool protection setting.

SmartPools is a concept that Isilon use to name Tier/Node Pool/File Pool Policy approach. File placement is not applied automatically, otherwise it would cause high I/O overhead. It’s implemented as a job on the cluster instead which runs at 22:00 every day by default.

Data Layout and Protection

Instead of using RAIDs, Isilon uses FEC (Forward Error Correction) and more specifically a Reed-Solomon algorithm to protect data on a cluster. It’s similar to RAID5 in how it generates a protection block (or blocks) for each stripe. But it happens on a software level, instead of hardware as in storage arrays. So when a file comes in to a node, Isilon splits the file in stripe units of 128KB each, generates one FEC protection unit and distributes all of them between the nodes using back-end network. This is what is called “+1” protection level, where Isilon can sustain one disk or one node failure. Then you have “+2”, “+3” and “+4”. In “+4” you have four FECs per stripe and can sustain four disk or node failures. Note however that there is a rule that the number of data stripe units in a stripe has to be greater than number of FEC units. So the minimum requirement for “+4” protection level is 9 nodes in a cluster.

dp2

The second option is to use mirroring. You can have from 2x to 8x mirrors of your data. And the third option is “+2:1” and “+3:1” protection levels. These protection levels let you balance between the data protection and amount of the FEC overhead. For example “+2:1” setting compared to “+2” can sustain two drive failures or one node failure, instead of two node failure protection that “+2” offers. And it makes sense, since simultaneous two node failure is unlikely to happen. There is also a difference in how the data is laid out. In “+2” for each stripe Isilon uses one disk on each node and in “+2:1” it uses two disks on each node. And first FEC in this case goes to first subset of disks and second goes to second.

One benefit of not having RAID is that you can set protection level with folder or even file granularity. Which is impossible with conventional RAIDs. And what’s quite handy, you can change protection levels without recreation of storage volumes, as you might have to do while transitioning between some of the RAID levels. When you change protection level for any of the targets, Isilon creates a low priority job which redistributes data within the cluster.

Advanced notes on Unified Parallel C installation

October 8, 2012

I already described basic Berkeley UPC compiler installation here. So now lets go deeper in details.

Backend Compilers

Basically UPC compiler is a translator from UPC language to C. After translation is done, backend C compiler is invoked to actually compile the code. On Linux clusters GCC is used by default, if you have Intel, Sun or any other high performance compiler installed, then use CC and CXX flags in UPC runtime configure step:

./configure CC=icc CXX=icpc --prefix=/opt/bupc-runtime-2.12.1-icc
./configure CC=suncc CXX=sunCC --prefix=/opt/bupc-runtime-2.10.0-suncc

Optional UPC builds

By default Berkeley UPC is installed in two configurations: debug (with GASnet assertions enabled and debugging info compiled in) and opt (optimized version for everyday use). You will see debug and opt subdirectories in your working UPC runtime build. But you can install additional versions of runtime for other uses.

Berkeley UPC has integrated tracing facility. If you upcrun application with the -trace flag, tracing data is collected and you can analyze it with upc_trace utility. Tracing build can be compiled by using opt_trace multiconf option:

./configure --prefix=/opt/bupc-runtime-2.12.1 --with-multiconf=+opt_trace

Berkeley UPC has integrated callbacks (called GASP) for third-party instrumenting utilities. Instrumentation allows developers of performance analysis tools to gather all sorts of information about UPC program execution. Like functions called, their arguments, etc. If you want to develop your own UPC performance analysis tool you can use this feature during development and instruct users to build opt_trace version of UPC to be able to use your tool later.

./configure --prefix=/opt/bupc-runtime-2.12.1 --with-multiconf=+opt_inst

You can debug UPC applications with dbg build, if you are a developer and use instrumented build of UPC and need to debug it, then build a dbg_inst version. There was a dbg_inst.patch (find link below) to add dbg_inst functionality to UPC, but it’s already integrated into compiler as far as I remember.

./configure --prefix=/opt/bupc-runtime-2.12.1 --with-multiconf=+dbg_inst

There was also another bug which broke dbg_inst in 2.12.1 (which was originally implemented in 2.10.0) with the following errors:

/root/install/berkeley_upc-2.12.1/gasnet/gasnet_trace.c: In function ‘gasneti_trace_finish’:

/root/install/berkeley_upc-2.12.1/gasnet/gasnet_trace.c:988: error: ‘gasneti_mallocreport_filename’ undeclared (first use in this function)

/root/install/berkeley_upc-2.12.1/gasnet/gasnet_trace.c:988: error: (Each undeclared identifier is reported only once

/root/install/berkeley_upc-2.12.1/gasnet/gasnet_trace.c:988: error: for each function it appears in.)

To resolve this issue apply mallocreport.patch00 (find link below). But if you use recent Berkeley UPC build you won’t see this bug.

Block size

If you work with huge matrixes and want to distribute them in large chunks of consecutive rows, then you will run into UPC limitation of block size. UPC pack pointer representation into one 64 bit integer. By default 34 bits are allocated for memory address, 10 bits for threads and 20 bits for phase (or block size). 2^20 is basically a 1048576 elements which is a very small number. You can redistribute bits with --with-sptr-packed-bits=value value=’phase,thread,addr’ configure option, but then you will either have small address space or small number of threads.

Another option is to use --enable-sptr-struct configure flag which changes shared pointer representation from int to struct. It will increase block size to 2^(32-1) which is 2147483647. But it could also be too small if you conduct performance measurement and need to run your code for 1 thread. Then the whole matrix is a one huge block. 50000×50000 matrix is already hit the limit.

If 2^(32-1) is not enough, then the last option for you is to use row distributed algorithm instead of row-block distributed.

POSIX shared memory problems with InfiniBand

UPC support two one-node inter-thread shared memory communication types: POSIX shared memory and SYSV shared memory. POSIX is configured by default. If you want to register large amounts of shared memory with many PSHM processes using --shared-heap key you can see errors like these:

*** FATAL ERROR: Unexpected error Bad address (rc=1 errno=14) when registering the segment

NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.

*** Caught a fatal signal: SIGABRT(6) on node 29/32

To solve this problem reinstall runtime using the following options:

./configure --prefix=/opt/bupc-runtime-2.12.1 --enable-pshm --disable-pshm-posix --enable-pshm-sysv

Bug when building translator

For some vendor-build GCC releases, like Red Hat, older versions of translator fail to compile with error like:

/usr/bin/ld: ipl_summarize_util.o: relocation R_X86_64_PC32 against `Phi_To_Idx_Map’ can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: ld returned 1 exit status

It’s a bug number 2202 in UPC Bugzilla and is described here. Solution and patch are described in post 17. Find copy of patch below.

UPC I/O support for large files

UPC have parallel I/O extension. In version 2.14.0 and earlier by default UPC I/O supported files 2GB in length. It led to upc_all_fread_shared() returning -1 “Invalid argument” for data above the 2GB limit. To change defaults from 2^(32-1) bits size to 2^(64-1) use BUPC_IO_64 variable during runtime configure step:

./configure CC=”gcc -DBUPC_IO_64″ CXX=”g++ -DBUPC_IO_64″ --prefix=/opt/bupc-runtime-2.12.1

Replace GCC with your own compiler.

SUN compiler issues

If you run into an error (I had it in version 2.10.0):

“/home/fred/install/berkeley_upc-2.10.0/upcr_profile.c”, line 36: left operand must be modifiable lvalue: op “=”
cc: acomp failed for /home/fred/install/berkeley_upc-2.10.0/upcr_globfiles.c

Apply patch sun_const_field.patch00 (find link below).  Additional info can be found in Berkeley UPC Bugzilla, bug number 2696.

Another bug (not an error, but an annoying warning) looks like numerous warnings throughout compilation:

“/home/fred/install/berkeley_upc-2.10.0/upcr_atomic.h”, line 876: warning: result of paste undefined and not portable: 64_ (E_PASTE_RESULT_NOT_TOKEN)
“/home/fred/install/berkeley_upc-2.10.0/upcr_atomic.h”, line 876: warning: result of paste undefined and not portable: 64_cswap (E_PASTE_RESULT_NOT_TOKEN)

To get rid of it apply patch not_token.patch00 (find link below). It’s described in the same 2696.

Links to patches

Unfortunately WordPress doesn’t allow to upload .txt files due to security reasons. Other formats, such as .doc or .pdf will break the lines. So I decided to give direct links when possible and provide contents of patch in text converted to .jpg format in case direct link will break in future. The drawback is that you will have to type it yourself or OCR it.

Reinstalling ROCKS compute cluster node

December 1, 2011

If you have any faulty HPC node and want to reinstall it for instance in case of hard drive replacement you should bare in mind several things:

  • Make sure xinetd is listening on 65 for tftpd requests on frontend.
  • Check for firewall rules. But you can simply switch it off during install. Otherwise you’ll get PXE-E32: TFTP open timeout.
  • Then you should configure your frontend to force compute node reinstallation. If you won’t do that you’ll just see PXE-M0F: Exiting HP PXE ROM or similar. Execute the following command on frontend: rocks set host boot <nodename> action=install.
  • In case you get an unable to read package metadata error during installation then go to /export/rocks/install/, remove rocks-dist folder and recreate installation tree by running rocks create distro.
  • After host installation put all  additional packages (like IB, MVAPICH, etc) into /share/apps and run rocks run host <nodename> “rpm -Uvh /share/apps/*.rpm”. Make necessary packages (like openibd and/or opensmd) to run upon startup via chkconfig and start them up. You may also need to copy some manually installed packages to compute node’s /opt directory.
  • In case you commented out faulty node earlier in /opt/torque/server_priv/nodes uncomment it and restart pbs_server service.

This is it. Now you should be good to go.

HP BladeSystem c3000

October 29, 2011

We have High Performace Computing (HPC) cluster I’d like to show. It has 72 cores and 152GB of RAM in total. We use ROCKS as cluster middleware. Interconnect is DDR InfiniBand.

We have two groups of servers. First group is two BL2x220c  blades. Since they are double-sided it’s actually four servers. Each with two 4-core CPUs and 16GB of RAM. Second group consists of five BL280c. Each of them also has two 4-core CPUs but 24 GB of RAM. Eighth server is BL260c. This blade serves as master server.

Click pictures to enlarge.

BL280c blade server. This dude has 8 Xeon cores and 24GB of RAM.

Every component of HP BladeSystem c3000 is hot-swappable. Here I show how I disconnect Onboard Adminstrator on fully operational system.

Fans, power supplies and all interconnects are on the back.

Here is the 16-port DDR InfiniBand switch. Each port’s throughput is 80GB/s FDX.

Uplink ports for Onboard Administrator.

16 ports of Ethernet pass-through for blade servers.


Six power supplies in N+1 redundant configuration. Each is capable of 1200 Watts. 7200 Watts in total.

Inside blade server.

InfiniBand mezzanine. One such module is capable of 80Gb/s FDX.

If you are interested in benchmarking results find them here for pure IB and here for IBoIP.

Benchmarking InfiniBand 2

March 2, 2011

Instead of benchmarking IBoIP as I did in my previous post it’s much more interesting to test pure InfiniBand. There is a so-called OSU Micro-Benchmarks suite from Ohio State. It has several tests: unidirectional bandwidth (bw), bidirectional bandwidth (bibw), unidirectional one-sided get (get_bw), unidirectional one-sided put (put_bw), bidirectional one-sided put (put_bibw) and same latency tests. Here is what I got on my Mellanox ConnectX cards:

ROCKS post install configuration

February 20, 2011

If you use ROCKS to deploy cluster you still need to do some manual configuration. What I personally did was:

1. Disable hyperthreading in BIOS. HT is not helpful for compute intensive tasks. It’s better suited for I/O intensive applications.

2. Add compute nodes to /opt/torque/server_priv/nodes in the following format:

compute-1-2 np=8

Where compute-1-2 is node’s domain name and np is the number of processors (cores).

3. Add master.local to /etc/hosts.equiv (for Torque).

4. Install InfiniBand stack. Most of IB RPMs are already installed but I also added:

libibumad, opensm-libs, opensm, ibutils, infiniband-diags, ibutils-libs, libibmad, libmlx4, swig

I just downloaded them by

# yumdownloader package-name

then put them into /share/apps and installed by

# rocks run host compute “rpm -Uvh /share/apps/*.rpm”

Then you will need to set openibd service to run upon startup on all compute nodes and opensmd service on any ONE of compute nodes.

5. Then you need to set up queues. There is a ‘default’ queue which is decent for simple setup. But I have nodes of two types and it’s handy to have separate queues. ‘qmgr’ is a binary which controls Torque queues. I made simple qmgr.in text file which contained:

create queue srail queue_type=execution
set queue srail started=true
set queue srail enabled=true
set queue srail resources_default.neednodes=srail

create queue mrail queue_type=execution
set queue mrail started=true
set queue mrail enabled=true
set queue mrail resources_default.neednodes=mrail

Then I fed this file to qmgr:

# qmgr < qmgr.in

To check qmgr configuration call:

# qmgr -c ‘p s’

At last change node records in /opt/torque/server_priv/nodes from

compute-1-2 np=8

to

compute-1-2 np=8 mrail

6. ROCKS don’t support LVM. It installs everything on first hard drive it finds. If you have two hard drives just mount second drive to /state/partition1/home and add home partition to /etc/exportfs.

Benchmarking InfiniBand

February 2, 2011

As I’ve already mentioned in my previous post called “Activating InfiniBand stack in Linux” there is a perftest package which has simple tests for benchmarking IB bandwidth and latency. Here go my results for default ib_write_bw and ib_write_lat tests. Write, read and send tests results don’t differ much, that’s why I’m posting only write results.

------------------------------------------------------------------
                    RDMA_Write BW Test
Number of qp's running 1
Connection type : RC
Each Qp will post up to 100 messages each time
Inline data is used up to 0 bytes message
  local address:  LID 0x04, QPN 0x18004a, PSN 0xcf8a2e
RKey 0x2c042529 VAddr 0x002af439bf2000
  remote address: LID 0x01, QPN 0x12004a, PSN 0xb446fe,
RKey 0x440428db VAddr 0x002b46ea9b5000
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
  65536        5000            1350.34               1350.27
------------------------------------------------------------------

------------------------------------------------------------------
                    RDMA_Write Latency Test
Inline data is used up to 400 bytes message
Connection type : RC
   local address: LID 0x04 QPN 0x16004a PSN 0x5d05e8
RKey 0x2a042529 VAddr 0x00000017f88002
  remote address: LID 0x01 QPN 0x10004a PSN 0xb8cade
RKey 0x420428db VAddr 0x00000000ae2002
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
      2        1000           1.16           6.93             1.22
------------------------------------------------------------------

Activating InfiniBand stack in Linux

February 2, 2011

I did this in CentOS 5. So some steps can differ if you work with another flavor of Linux.

Major implementation of InfiniBand stack is called OFED. It’s a collaborative development of several vendors to standardize APIs. Allience HQ site is http://www.openfabrics.org/. You can take vanilla OFED version there. Vendors also release their own tweaked versions of original OFED, like Mellanox OFED.

OFED stack provides several interfaces for underlying hardware:

  1. First and simplest way of working with InfiniBand is IPoIB. In this case IP stack is put above IB. You don’t need to rewrite your applications while you can utilize high throughput. On the other hand you will kill IB low latencies and won’t be able to utilize whole IB throughput capabilities.
  2. Second way is Sockets Direct Protocol (SDP) which is designed to utilize IB RDMA capabilities  and bypass TCP/IP stack. SDP can be used transparently w/o recompiling your application. It’s not that fast as native IB API but os better than IPoIB.
  3. Third and hardest way is using IB Verbs which is the lowest API, User Direct Access Programming Library (uDAPL) which is based on IB Verbs, Message Passing Interface (MPI) or Unified Parallel C. Different versions of MPI and UPC can be based on either IB Verbs or uDAPL. I personally work with MPI and UPC so I will describe their installation over InfiniBand.

To bring up your InfiniBand hardware you need to do the following things:

1. Install kernel level and user level HCA drivers.  Interesting thing about IB is that driver is split into two parts. Kernel part is usually already compiled as modules. You just need an openibd package which will run as a service upon startup and will load all necessary kernel modules. User level part you should install by your self. Since I’m working with Mellanox ConnectX MT25418 I had to install libmlx4 package.

2. Next thing is Subnet Manager. The InfiniBand subnet manager (OpenSM) assigns Local IDentifiers (LIDs) to each port connected to the InfiniBand fabric, and develops a routing table based off of the assigned LIDs. opensmd package performs Subnet Manager role. You need to set this service up on any one IB node and it will initiate IB fabric upon system startup.

3. Install IB API.  Depending on what software you going to use install appropriate API. I use MPI and UPC parallel applications compilers which use IB Verbs.  libibverbs package is responsible for that.

Additionally  you would probably like to use some diagnostic tools to check IB connectivity. Found them in ibutils and infiniband-diags RPMs:

  • ibv_devinfo, ibstat, ibstatus show device information;
  • sminfo, ibnodes, ibhosts, ibswitches, ibnetdiscover, ibchecknet show fabric information.

Use perftest package for IB performance testing. Simply run ib_write_bw in server mode on one node by just calling # ib_write_bw and run client on the other node # ib_write_bw node_name. Use ib_write_lat for latency testing.

You will probably also need libibcommon, libibumad and libibmad libraries which is needed for opensm, infinband-diags and ibutils to operate.

Links I find useful:

Ethernet vs. InfiniBand

January 18, 2010

Have you ever tried to compare scalability of Ethernet versus InfiniBand on HPC cluster? I was shocked.

It’s a solution of three-dimensional partial differential equation using the Fast Fourier transform. Comparison isn’t absolutely fair. Because I used gcc compiler for Ethernet version and Sun compiler for InfiniBand. Hence the difference between versions on small number of CPUs. But regardess of the fact that gcc is a bit faster Ethernet shows no scalability at all! It’s unbelievable how Ethernet is slow.