Posts Tagged ‘cluster’

RecoverPoint VE: Common Deployment Issues

April 19, 2016

fixIn one of my previous posts I discussed iSCSI connectivity considerations when deploying RecoverPoint VE. In this post I want to describe common issues you may encounter when deploying RecoverPoint clusters, most of which are applicable to both physical appliance and virtual editions.

VNX MirrorView ports

I already touched on that briefly in my previous post. But it’s worth mentioning again that you can NOT use MirrorView ports for iSCSI connectivity between RPAs and VNX arrays. When you try to use a MirrorView iSCSI port for RecoverPoint, it gets upset and doesn’t communicate with the array.

If you make a mistake of connecting one port per SP and this port is a MirrorView port, you will have no communication with the array at all and get the following error in Unisphere for RecoverPoint:

Error Splitter ARRAYNAME-A is down
Error Splitter ARRAYNAME-B is down

splitter_error

If you connect two ports per SP, one of which is MirrorView port and use two iSCSI network subnets you may get the following error when running a SAN connectivity test from the RPA boxmgmt interface. In this case RPA can communicate with the array only over one subnet:

On array ABCD1234567890, all paths for device with UID=0x1234567890abcdef go through RPA Ethernet port eth2 …

multipathing_issue

The solution is as simple as moving the link from port 0 to port 1 on a 10Gb I/O module. And from port 0 to port 1,2 or 3 on a 1Gb I/O module.

If you don’t want to lose two iSCSI ports (1 per SP), especially if it’s 10Gb, and you’re not using MirrorView, you can uninstall MirrorView enabler from the array. Just keep in mind that it will require an array reboot. Service processors will be rebooted one at a time, so there is no downtime. But if it’s a heavily used storage array it’s recommended to schedule uninstallation out of hours to minimize the impact.

Error when redeploying a cluster

If you’ve made configuration mistakes while deploying a RecoverPoint cluster and want to blow the whole thing away and redeploy it from scratch you may encounter the following error when deploying for the second time:

VNX path set with IP 10.10.10.1 already exists in a different path set (RP_0x123abc456def789g_0_iSCSI1)

rpa_redeploy

The cause of the issue is iSCSI sessions which stayed on the VNX after you deleted RPA VMs. You need to connect to the VNX and delete them in Unisphere manually by right-clicking on the storage array name on the dashboard and selecting iSCSI > Connections Between Storage Systems. This is what duplicate sessions look like:

duplicate_rp

As you can see there’re three sets of RecoverPoint cluster iSCSI connections after three unsuccessful attempts.

You will need to delete old sessions before you are able to proceed with the deployment in RecoverPoint Deployment Manager.

Wrong initiator names

I’ve seen this on multiple occasions when RecoverPoint registers initiators on VNX with inconsistent hostnames.

As you’ve seen on the screenshots above, hostname field of every initiator consists of the cluster ID and RPA ID (not sure what the third field means), such as this:

RP_0x123abc456def789g_1_0

In this example you can see that RPA1 has two hostnames with suffixes _0_0 and _1_0.

wrong_initiators

This issue is purely cosmetic and doesn’t affect RecoverPoint operation, but if you want to fix it you will need to restart Management Servers on VNX service processors. It’s a non-disruptive procedure and can be performed by opening the following link http://SP_IP/setup and clicking on “Restart Management Server” button.

After a restart, array will update hostnames to reflect the actual configuration.

Joining two clusters with the licences already applied

This is just not going to work. Make sure to join production and DR clusters before applying RecoverPoint licences or Deployment Manager “Connect Cluster” wizard will fail.

It’s one of the prerequisites specified in RecoverPoint “Installation and Deployment Guide”:

If you plan to connect the new cluster immediately after preparing it for connection,
ensure:

  • You do not install a license in, or modify the settings of, the new cluster before
    connecting it to the existing system.

Conclusion

There’re always much more things that can potentially go wrong. But if any of the above helped you to solve your RecoverPoint deployment issues make sure to let me know in the comments below!

Advertisements

RecoverPoint VE: iSCSI Network Design

March 29, 2016

recoverpointRecoverPoint is a great storage replication product, which supports Continuous Data Protection (CDP) and gives you RPO figures measured in second compared to a standard asynchronous storage-based replication solutions, where RPO is measured in minutes or even hours.

RecoverPoint comes in three flavours:

  • RecoverPoint SE/EX/CL – physical appliance for replication between VNX (RecoverPoint/SE), VNX/VMAX/VPLEX (RecoverPoint/EX) or EMC and non-EMC (RecoverPoint CL) storage arrays.
  • RecoverPoint VE – virtual edition of RecoverPoint which is installed as a VM and supports the same SE/EX/CL versions.
  • RecoverPoint for Virtual Machines – also a virtual appliance but is array-agnostic and works at a hypervisor level by replicating VMs instead of LUNs.

In this blog post we will be discussing connectivity options for RecoverPoint VE (SE edition). Make sure to not confuse RecoverPoint VE and RecoverPoint for Virtual Machines as it’s two completely different products.

VNX MirrorView ports

MirrorView is an another EMC replication solution integrated into VNX arrays. If there’s a MirrorView enabler installed, it will claim itself the first FC port and the first iSCSI port. When patching VNX iSCSI ports make sure to NOT use the ports claimed by MirrorView.

mirrorview_ports

If you use 1GbE (4-port) I/O modules you can use three ports per SP (all except port 0) and if you have 10GbE (2-port) I/O modules you can use one port per SP. I will talk about workarounds for this in the next blog post.

RPA appliance iSCSI vNICs

Each RecoverPoint appliance has two iSCSI NICs, which can be configured on either one or two subnets. If you use one 10Gb port on each SP as in the example above, then you’re forced to use one subnet. Because you obviously need at least two ports on each SP to have two networks.

If you have 1Gb modules in your VNX array, then you will most likely have two 1Gb iSCSI ports connected on each SP. In that case you can use two iSCSI subnets to reduce the number of iSCSI sessions between RPAs and a VNX.

On the vSphere side you will need to create one or two iSCSI port groups, depending on how many subnets you’ve decided to allocate and connect RPA vNICs accordingly.

rpa_iscsi

VNX iSCSI Connections

RecoverPoint clusters are deployed and connected using a special tool called Deployment Manager. It assigns all IP addresses, connects RecoverPoint clusters to VNX arrays and joins sites together.

Once deployment is finished you will have iSCSI connections created on the VNX array. Depending on how many iSCSI subnets you’re using, iSCSI connections will be configured accordingly.

1. One Subnet Example

Lets look at the one subnet topology first. In this example you have one 10Gb port per VNX SP and two ports on each of the two RPAs all on one subnet. When you right click on the storage array in Unisphere and select iSCSI > Connections Between Storage Systems you should see something similar to this.

iscsi_connections

As you can see ports iSCSI1 and iSCSI2 on RPA0 and RPA1 are mapped to two ports on the storage array A-5 and B-5. Four RPA ports are connected to two VNX ports which gives you eight iSCSI initiator records on the VNX.

iscsi_initiators

2. Two Subnets Example

If you connect two 1Gb ports per VNX SP and decide to use two subnets, then each SP will have one port on each of the two subnets. Same goes for the RPAs. Each RPA will have one vNIC connected to each subnet.

iSCSI connections will be set up a little bit differently now. Because only the VNX and RPA ports which are on the same subnet should be able to talk to each other.

iscsi_connections2

Every RPA in this example has one IP on the xxx.xxx.46.0/255.255.255.192 subnet (iSCSI A) and one IP on the xxx.xxx.46.64/255.255.255.192 subnet (iSCSI B). Similarly, ports A-10 and B-10 on the VNX are configured on iSCSI A subnet. And ports A-11 and B-11 are configured on iSCSI B subnet. Because of that, iSCSI1 ports are mapped to ports A-10/B-10 and iSCSI2 ports are mapped to ports A-11/B-11.

As we are using two subnets in this example instead of 4 RPA ports by 4 VNX ports = 16 iSCSI connections, we will have 2 RPA ports by 2 VNX ports (subnet iSCSI A) + 2 RPA ports by 2 VNX ports (subnet iSCSI B) = 8 iSCSI connections.

iscsi_initiators2

Conclusion

The goal of this post was to discuss the points which are not very well explained in RecoverPoint documentation. It’s not a comprehensive guide by any means. You can find the full deployment procedure with prerequisites, installation and configuration steps in EMC RecoverPoint Installation and Deployment Guide.

Upgrading Cisco UCS Fabric Interconnects

March 17, 2016

I have to do this first, as this is a high-risk change for any environment:

disclaimerDISCLAMER: I ACCEPT NO RESPONSIBILITY FOR ANY DAMAGE OR CORRUPTION OF DATA THAT MAY OCCUR AS A RESULT OF CARRYING OUT STEPS DESCRIBED BELOW. YOU DO THIS AT YOUR OWN RISK.

And now to the point. Cisco has two generations of Fabric Interconnects with the third generation released just recently. There is 6100 series, which includes 6120XP and 6140XP. Second generation is 6200 series, which introduced unified ports and also has two models in its range – 6248UP and 6296UP. And there is now a third generation of 40Gb fabric interconnects with 6324, 6332 and 6332-16UP models.

We are yet to see mass adoption of 40Gb FIs. And some of the customers are still upgrading from the first to the second generation.

In this blog post we will go through the process of upgrading 6100 fabric interconnects to 6200 by using 6120 and 6248 as an example.

Prerequisites

Cisco UCS has a pair of fabric interconnects which work in an active/passive mode from a control plane perspective. This lets us do an in-place upgrade of a FI cluster by upgrading interconnects one at a time without any further reconfiguration needed in UCS Manager in most cases.

For a successful upgrade old and new interconnects MUST run on the same firmware revision. That means you will need to upgrade the first new FI to the same firmware before you can join it to the cluster to replace the first old FI.

This can be done by booting the FI in a standalone mode, giving it an IP address and installing firmware via UCS Manager.

The second FI won’t need a manual firmware update, because when a FI of the same hardware model is joined to a cluster it’s upgraded automatically from the other FI.

Preparation tasks

It’s a good idea to make a record of all connections from the current fabric interconnects and make a configuration backup before an upgrade.

ucs_backup

If you have any unused connections which you’re not planning to move, it’s a good time to disconnect the cables and disable these ports.

Cisco strongly suggests to also upgrade the firmware on all software and hardware components of the existing UCS to the latest recommended version first.

Upgrading firmware on the first new FI

Steps to upgrade firmware on the first new fabric interconnect are as follows:

  • Rack and stack the new FI close enough to the old interconnects to make sure all cables can reach it.
  • Connect a console cable to the new FI, boot it up and when you are asked “Is this Fabric interconnect part of a cluster”, select NO to boot the FI in a standalone mode.
  • Assign an IP address to the FI and connect to it using UCS Manager.
  • Upgrade the firmware, which will reboot the fabric interconnect.
  • Reset the configuration on the FI, which will cause another reboot:
    • # connect local-mgmt
      # erase config

  • Once the FI is upgraded and reset to factory defaults you can proceed with joining it to the cluster.

Replacing the first FI

  • Determine which old FI is in the subordinate mode (upgrade a FI only if it’s in subordinate mode!) and disable server ports on it.
  • Shut down the old subordinate FI.
  • Move L1/L2, management, server and Ethernet/FC/FCoE uplink ports to the new FI.
  • Boot the new FI. This time the new FI will detect the presence of the peer FI. When you see the following prompt type YES:
    • Installer has detected the presence of a peer Fabric interconnect. This Fabric interconnect will be added to the cluster. Continue (y/n) ?

  • Follow the console prompts and assign an IP address to the new FI. The rest of the settings will be pulled from the peer FI.

Once the new FI joins the cluster you should see the following equipment topology in UCS Manager (This screenshot was made after the primary role had been moved to the new FI. Initially you should see the new FI as subordinate.):

two_fis

  • At this stage make sure that all configuration has been applied to the new FI and you can see all LAN and SAN uplinks and port channels.
  • Enable server ports on the new FI and reacknowledge all chassis.

Reacknowledging a chassis might be disruptive to the traffic flow from the blades. So make sure you don’t have any production workloads running on it. If you have two chassis and enough capacity to run all VMs on either of them, you can temporarily move VMs between the chassis and reacknowledge one chassis at a time.

Replacing the second FI

You will need to promote the new FI to be the primary, before proceeding with an upgrade of the second FI. To change the roles, use SSH to log in to the old FI, which is currently the primary (you can’t change roles from the subordinate FI) and run the following commands:

# connect local-mgmt
# cluster lead b
# show cluster state

The rest of the process is exactly the same.

After the upgrade, if needed, reconfigure any of the links which may have had their port numbers changed, such as if you had an expansion module in the old FIs, but not on the new FIs.

References

Cisco has a guide which has a step by step procedures for upgrading fabric interconnects, I/O modules, VIC cards as well as rack-mount servers. Refer to this guide for any further clarifications:

 

EMC Isilon Overview

February 20, 2014

isilon_logo_188x110OneFS Overview

EMC Isilon OneFS is a storage OS which was built from the ground up as a clustered system.

NetApp’s Clustered ONTAP for example has evolved from being an OS for HA-pair of storage controllers to a clustered system as a result of integration with Spinnaker intellectual property. It’s not necessarily bad, because cDOT shows better performance on SPECsfs2008 than Isilon, but these systems still have two core architectural differences:

1. Isilon doesn’t have RAIDs and complexities associated with them. You don’t choose RAID protection level. You don’t need to think about RAID groups and even load distribution between them. You don’t even have spare drives per se.

2. All data on Isilon system is kept on one volume, which is a one big distributed file system. cDOT use concept of infinite volumes, but bear in mind that each NetApp filer has it’s own file system beneath. If you have 24 NetApp nodes in a cluster, then you have 24 underlying file systems, even though they are viewed as a whole from the client standpoint.

This makes Isilon very easy to configure and operate. But its simplicity comes at a price of flexibility. Isilon web interface has few options to configure and not very feature rich.

Isilon Nodes and Networking

In a nutshell Isilon is a collection of a certain number of nodes connected via 20Gb/s DDR InfiniBand back-end network and either 1GB/s or 10GB/s front-end network for client connections. There are three types of Isilon nodes S-Series (SAS + SSD drives) for transactional random access I/O, X-Series (SATA + SSD drives) for high throughput applications and NL-series (SATA drives) for archival or not frequently used data.

If you choose to have two IB switches at the back-end, then you’ll have three subnets configured for internal network: int-a, int-b and failover. You can think of a failover network as a virtual network in front of int-a and int-b. So when the packet comes to failover network IP address, the actual IB interface that receives the packet is chosen dynamically. That helps to load-balance the traffic between two IB switches and makes this set up an active/active network.

131_22

On the front-end you can have as many subnets as you like. Subnets are split between pools of IP addresses. And you can add particular node interfaces to the pool. Each pool can have its own SmartConnect zone configured. SmartConnect is a way to load-balance connections between the nodes. Basically SmartConnect is a DNS server which runs on the Isilon side. You can have one SmartConnect service on a subnet level. And one SmartConnect zone (which is simply a domain) on each of the subnet pools. To set up SmartConnect you’ll need to assign an IP address to the SmartConnect service and set a SmartConnect zone name on a pool level. Then you create an “A” record on DNS for the SmartConnect service IP address and delegate SmartConnect DNS zone to this IP. That way each time you refer to the SmartConnect zone to get access to a file share you’ll be redirected to dynamically picked up node from the pool.

SmartPools

Each type of node is automatically assigned to what is called a “Node Pool”. Nodes are grouped to the same pool if they are of the same series, have the same amount of memory and disks of the same type and size. Node Pool level is one of the spots where you can configure protection level. We’ll talk about that later. Node Pools are grouped within Tiers. So you can group NL node pool with 1TB drives and NL node pool with 3TB drives into an archive tier if you wish. And then you have File Pool Policies which you can use to manage placement of files within the cluster. For example, you can redirect files with specific extension or file size or last access time to be saved on a specific node pool or tier. File pool policies also allow you to configure data protection and override the default node pool protection setting.

SmartPools is a concept that Isilon use to name Tier/Node Pool/File Pool Policy approach. File placement is not applied automatically, otherwise it would cause high I/O overhead. It’s implemented as a job on the cluster instead which runs at 22:00 every day by default.

Data Layout and Protection

Instead of using RAIDs, Isilon uses FEC (Forward Error Correction) and more specifically a Reed-Solomon algorithm to protect data on a cluster. It’s similar to RAID5 in how it generates a protection block (or blocks) for each stripe. But it happens on a software level, instead of hardware as in storage arrays. So when a file comes in to a node, Isilon splits the file in stripe units of 128KB each, generates one FEC protection unit and distributes all of them between the nodes using back-end network. This is what is called “+1” protection level, where Isilon can sustain one disk or one node failure. Then you have “+2”, “+3” and “+4”. In “+4” you have four FECs per stripe and can sustain four disk or node failures. Note however that there is a rule that the number of data stripe units in a stripe has to be greater than number of FEC units. So the minimum requirement for “+4” protection level is 9 nodes in a cluster.

dp2

The second option is to use mirroring. You can have from 2x to 8x mirrors of your data. And the third option is “+2:1” and “+3:1” protection levels. These protection levels let you balance between the data protection and amount of the FEC overhead. For example “+2:1” setting compared to “+2” can sustain two drive failures or one node failure, instead of two node failure protection that “+2” offers. And it makes sense, since simultaneous two node failure is unlikely to happen. There is also a difference in how the data is laid out. In “+2” for each stripe Isilon uses one disk on each node and in “+2:1” it uses two disks on each node. And first FEC in this case goes to first subset of disks and second goes to second.

One benefit of not having RAID is that you can set protection level with folder or even file granularity. Which is impossible with conventional RAIDs. And what’s quite handy, you can change protection levels without recreation of storage volumes, as you might have to do while transitioning between some of the RAID levels. When you change protection level for any of the targets, Isilon creates a low priority job which redistributes data within the cluster.

Overview of NetApp Replication and HA features

August 9, 2013

NetApp has quite a bit of features related to replication and clustering:

  • HA pairs (including mirrored HA pairs)
  • Aggregate mirroring with SyncMirror
  • MetroCluster (Fabric and Stretched)
  • SnapMirror (Sync, Semi-Sync, Async)

It’s easy to get lost here. So lets try to understand what goes where.

Simple-Metrocluster

SnapMirror

SnapMirror is a volume level replication, which normally works over IP network (SnapMirror can work over FC but only with FC-VI cards and it is not widely used).

Asynchronous version of SnapMirror replicates data according to schedule. SnapMiror Sync uses NVLOGM shipping (described briefly in my previous post) to synchronously replicate data between two storage systems. SnapMirror Semi-Sync is in between and synchronizes writes on Consistency Point (CP) level.

SnapMirror provides protection from data corruption inside a volume. But with SnapMirror you don’t have automatic failover of any sort. You need to break SnapMirror relationship and present data to clients manually. Then resynchronize volumes when problem is fixed.

SyncMirror

SyncMirror mirror aggregates and work on a RAID level. You can configure mirroring between two shelves of the same system and prevent an outage in case of a shelf failure.

SyncMirror uses a concept of plexes to describe mirrored copies of data. You have two plexes: plex0 and plex1. Each plex consists of disks from a separate pool: pool0 or pool1. Disks are assigned to pools depending on cabling. Disks in each of the pools must be in separate shelves to ensure high availability. Once shelves are cabled, you enable SyncMiror and create a mirrored aggregate using the following syntax:

> aggr create aggr_name -m -d disk-list -d disk-list

HA Pair

HA Pair is basically two controllers which both have connection to their own and partner shelves. When one of the controllers fails, the other one takes over. It’s called Cluster Failover (CFO). Controller NVRAMs are mirrored over NVRAM interconnect link. So even the data which hasn’t been committed to disks isn’t lost.

MetroCluster

MetroCluster provides failover on a storage system level. It uses the same SyncMirror feature beneath it to mirror data between two storage systems (instead of two shelves of the same system as in pure SyncMirror implementation). Now even if a storage controller fails together with all of its storage, you are safe. The other system takes over and continues to service requests.

HA Pair can’t failover when disk shelf fails, because partner doesn’t have a copy to service requests from.

Mirrored HA Pair

You can think of a Mirrored HA Pair as HA Pair with SyncMirror between the systems. You can implement almost the same configuration on HA pair with SyncMirror inside (not between) the system. Because the odds of the whole storage system (controller + shelves) going down is highly unlike. But it can give you more peace of mind if it’s mirrored between two system.

It cannot failover like MetroCluster, when one of the storage systems goes down. The whole process is manual. The reasonable question here is why it cannot failover if it has a copy of all the data? Because MetroCluster is a separate functionality, which performs all the checks and carry out a cutover to a mirror. It’s called Cluster Failover on Disaster (CFOD). SyncMirror is only a mirroring facility and doesn’t even know that cluster exists.

Further Reading

NetApp NVRAM and Write Caching

July 19, 2013

388375Overview

NetApp storage systems use several types of memory for data caching. Non-volatile battery-backed memory (NVRAM) is used for write caching (whereas main memory and flash memory in forms of either extension PCIe card or SSD drives is used for read caching). Before going to hard drives all writes are cached in NVRAM. NVRAM memory is split in half and each time 50% of NVRAM gets full, writes are being cached to the second half, while the first half is being written to disks. If during 10 seconds interval NVRAM doesn’t get full, it is forced to flush by a system timer.

To be more precise, when data block comes into NetApp it’s actually written to main memory and then journaled in NVRAM. NVRAM here serves as a backup, in case filer fails. When data has been written to disks as part of so called Consistency Point (CP), write blocks which were cached in main memory become the first target to be evicted and replaced by other data.

Caching Approach

NetApp is frequently criticized for small amounts of write cache. For example FAS3140 has only 512MB of NVRAM, FAS3220 has a bit more 1,6GB. In mirrored HA or MetroCluster configurations NVRAM is mirrored via NVRAM interconnect adapter. Half of the NVRAM is used for local operations and another half for the partner’s. In this case the amount of write cache becomes even smaller. In FAS32xx series NVRAM has been integrated into main memory and is now called NVMEM. You can check the amount of NVRAM/NVMEM in your filer by running:

> sysconfig -a

The are two answers to the question why NetApp includes less cache in their controllers. The first one is given in white paper called “Optimizing Storage Performance and Cost with Intelligent Caching“. It states that NetApp uses different approach to write caching, compared to other vendors. Most often when data block comes in, cache is used to keep the 8KB data block, as well as 8KB inode and 8KB indirect block for large files. This way, write cache can be thought as part of the physical file system, because it mimics its structure. NetApp on the other hand uses journaling approach. When data block is received by the filer, 8KB data block is cached along with 120B header. Header contains all the information needed to replay the operation. After each cache flush Consistency Point (CP) is created, which is a special type of consistent file system snapshot. If controller fails, the only thing which needs to be done is reverting file system to the latest consistency point and replaying the log.

But this white paper was written in 2010. And cache journaling is not a feature unique to NetApp. Many vendors are now using it. The other answer, which makes more sense, was found on one of the toaster mailing list archives here: NVRAM weirdness (UNCLASSIFIED). I’ll just quote the answer:

The reason it’s so small compared to most arrays is because of WAFL. We don’t need that much NVRAM because when writes happen, ONTAP writes out single complete RAID stripes and calculates parity in memory. If there was a need to do lots of reads to regenerate parity, then we’d have to increase the NVRAM more to smooth out performance.

NVLOG Shipping

A feature called NVLOG shipping is an integral part of sync and semi-sync SnapMirror. NVLOG shipping is simply a transfer of NVRAM writes from the primary to a secondary storage system.  Writes on primary cannot be transferred directly to NVRAM of the secondary system, because in contrast to mirrored HA and MetroCluster, SnapMirror doesn’t have any hardware implementation of the NVRAM mirroring. That’s why the stream of data is firstly written to the special files on the volume’s parent aggregate on the secondary system and then are read to the NVRAM.

nvram

Documents I found useful:

WP-7107: Optimizing Storage Performance and Cost with Intelligent Caching

TR-3326: 7-Mode SnapMirror Sync and SnapMirror Semi-Sync Overview and Design Considerations

TR-3548: Best Practices for MetroCluster Design and Implementation

United States Patent 7730153: Efficient use of NVRAM during takeover in a node cluster

Present NetApp iSCSI LUN to Linux host

March 7, 2012

Consider the following scenario (which is in fact a real case). You have a High Performance Computing (HPC) cluster where users usually generate hellova research data. Local hard drives on a frontend node are almost always insufficient. There are two options. First is presenting a NFS share both to frontend and all compute nodes. Since usually compute nodes  connect only to private network for communication with the frontend and don’t have public ip addresses it means a lot of reconfiguration. Not to mention possible security implications.

The simpler solution here is to use iSCSI.  Unlike NFS, which requires direct communication, with iSCSI you can mount a LUN to the frontend and then compute nodes will work with it as ordinary NFS share through the private network. This implies configuration of iSCSI LUN on a NetApp filer and bringing up iSCSI initiator in Linux.

iSCSI configuration consists of several steps. First of all you need to create FlexVol volume where you LUN will reside and then create a LUN inside of it. Second step is creation of initiator group which will enable connectivity between NetApp and a particular host.  And as a last step you will need to map the LUN to the initiator group. It will let the Linux host to see this LUN. In case you disabled iSCSI, don’t forget to enable it on a required interface.

vol create scratch aggrname 1024g
lun create -s 1024g -t linux /vol/scratch/lun0
igroup create -i -t linux hpc
igroup add hpc linux_host_iqn
lun map /vol/scratch/lun0 hpc
iscsi interface enable if_name

Linux host configuration is simple. Install iscsi-initiator-utils packet and add it to init on startup. iSCSI IQN which OS uses for connection to iSCSI targets is read from /etc/iscsi/initiatorname.iscsi upon startup. After iSCSI initiator is up and running you need to initiate discovery process, and if everything goes fine you will see a new hard drive in the system (I had to reboot). Then you just create a partition, make a file system and mount it.

iscsiadm -m discovery -t sendtargets -p nas_ip
fdisk /dev/sdc
mke2fs -j /dev/sdc1
mount /dev/sdc1 /state/partition1/home

I use it for the home directories in ROCKS cluster suite. ROCKS automatically export /home through NFS to compute nodes, which in their turn mount it via autofs. If you intend to use this volume for other purposes, then you will need to configure you custom NFS export.

Random DC pictures

January 19, 2012

Several pictures of server room hardware with no particular topic.

Click pictures to enlarge.

10kVA APC UPS.

UPS’s Network Management Card (NMC) (with temperature sensor) connected to LAN.

Here you can see battery extenders (white plugs). They allow UPS to support 5kVA of load for 30 mins.

Two Dell PowerEdge 1950 server with 8 cores and 16GB RAM each configured as VMware High Availability (HA) cluster.

Each server has 3 virtual LANs. Each virtual LAN has its own NIC which in its turn has multi-path connection to Cisco switch by two cables, 6 cables in total.

Two Cisco switches which maintain LAN connections for NetApp filers, Dell servers, Sun tape library and APC NMC card. Two switches are tied together by optic cable. Uplink is a 2Gb/s trunk.

HP rack with 9 HP ProLiant servers, HP autoloader and MSA 1500 storage.

HP autoloader with 8 cartridges.

HP MSA 1500 storage which is completely FC.

Hellova cables.

Reinstalling ROCKS compute cluster node

December 1, 2011

If you have any faulty HPC node and want to reinstall it for instance in case of hard drive replacement you should bare in mind several things:

  • Make sure xinetd is listening on 65 for tftpd requests on frontend.
  • Check for firewall rules. But you can simply switch it off during install. Otherwise you’ll get PXE-E32: TFTP open timeout.
  • Then you should configure your frontend to force compute node reinstallation. If you won’t do that you’ll just see PXE-M0F: Exiting HP PXE ROM or similar. Execute the following command on frontend: rocks set host boot <nodename> action=install.
  • In case you get an unable to read package metadata error during installation then go to /export/rocks/install/, remove rocks-dist folder and recreate installation tree by running rocks create distro.
  • After host installation put all  additional packages (like IB, MVAPICH, etc) into /share/apps and run rocks run host <nodename> “rpm -Uvh /share/apps/*.rpm”. Make necessary packages (like openibd and/or opensmd) to run upon startup via chkconfig and start them up. You may also need to copy some manually installed packages to compute node’s /opt directory.
  • In case you commented out faulty node earlier in /opt/torque/server_priv/nodes uncomment it and restart pbs_server service.

This is it. Now you should be good to go.

HP BladeSystem c3000

October 29, 2011

We have High Performace Computing (HPC) cluster I’d like to show. It has 72 cores and 152GB of RAM in total. We use ROCKS as cluster middleware. Interconnect is DDR InfiniBand.

We have two groups of servers. First group is two BL2x220c  blades. Since they are double-sided it’s actually four servers. Each with two 4-core CPUs and 16GB of RAM. Second group consists of five BL280c. Each of them also has two 4-core CPUs but 24 GB of RAM. Eighth server is BL260c. This blade serves as master server.

Click pictures to enlarge.

BL280c blade server. This dude has 8 Xeon cores and 24GB of RAM.

Every component of HP BladeSystem c3000 is hot-swappable. Here I show how I disconnect Onboard Adminstrator on fully operational system.

Fans, power supplies and all interconnects are on the back.

Here is the 16-port DDR InfiniBand switch. Each port’s throughput is 80GB/s FDX.

Uplink ports for Onboard Administrator.

16 ports of Ethernet pass-through for blade servers.


Six power supplies in N+1 redundant configuration. Each is capable of 1200 Watts. 7200 Watts in total.

Inside blade server.

InfiniBand mezzanine. One such module is capable of 80Gb/s FDX.

If you are interested in benchmarking results find them here for pure IB and here for IBoIP.