Posts Tagged ‘RAID’

EMC Isilon Overview

February 20, 2014

isilon_logo_188x110OneFS Overview

EMC Isilon OneFS is a storage OS which was built from the ground up as a clustered system.

NetApp’s Clustered ONTAP for example has evolved from being an OS for HA-pair of storage controllers to a clustered system as a result of integration with Spinnaker intellectual property. It’s not necessarily bad, because cDOT shows better performance on SPECsfs2008 than Isilon, but these systems still have two core architectural differences:

1. Isilon doesn’t have RAIDs and complexities associated with them. You don’t choose RAID protection level. You don’t need to think about RAID groups and even load distribution between them. You don’t even have spare drives per se.

2. All data on Isilon system is kept on one volume, which is a one big distributed file system. cDOT use concept of infinite volumes, but bear in mind that each NetApp filer has it’s own file system beneath. If you have 24 NetApp nodes in a cluster, then you have 24 underlying file systems, even though they are viewed as a whole from the client standpoint.

This makes Isilon very easy to configure and operate. But its simplicity comes at a price of flexibility. Isilon web interface has few options to configure and not very feature rich.

Isilon Nodes and Networking

In a nutshell Isilon is a collection of a certain number of nodes connected via 20Gb/s DDR InfiniBand back-end network and either 1GB/s or 10GB/s front-end network for client connections. There are three types of Isilon nodes S-Series (SAS + SSD drives) for transactional random access I/O, X-Series (SATA + SSD drives) for high throughput applications and NL-series (SATA drives) for archival or not frequently used data.

If you choose to have two IB switches at the back-end, then you’ll have three subnets configured for internal network: int-a, int-b and failover. You can think of a failover network as a virtual network in front of int-a and int-b. So when the packet comes to failover network IP address, the actual IB interface that receives the packet is chosen dynamically. That helps to load-balance the traffic between two IB switches and makes this set up an active/active network.


On the front-end you can have as many subnets as you like. Subnets are split between pools of IP addresses. And you can add particular node interfaces to the pool. Each pool can have its own SmartConnect zone configured. SmartConnect is a way to load-balance connections between the nodes. Basically SmartConnect is a DNS server which runs on the Isilon side. You can have one SmartConnect service on a subnet level. And one SmartConnect zone (which is simply a domain) on each of the subnet pools. To set up SmartConnect you’ll need to assign an IP address to the SmartConnect service and set a SmartConnect zone name on a pool level. Then you create an “A” record on DNS for the SmartConnect service IP address and delegate SmartConnect DNS zone to this IP. That way each time you refer to the SmartConnect zone to get access to a file share you’ll be redirected to dynamically picked up node from the pool.


Each type of node is automatically assigned to what is called a “Node Pool”. Nodes are grouped to the same pool if they are of the same series, have the same amount of memory and disks of the same type and size. Node Pool level is one of the spots where you can configure protection level. We’ll talk about that later. Node Pools are grouped within Tiers. So you can group NL node pool with 1TB drives and NL node pool with 3TB drives into an archive tier if you wish. And then you have File Pool Policies which you can use to manage placement of files within the cluster. For example, you can redirect files with specific extension or file size or last access time to be saved on a specific node pool or tier. File pool policies also allow you to configure data protection and override the default node pool protection setting.

SmartPools is a concept that Isilon use to name Tier/Node Pool/File Pool Policy approach. File placement is not applied automatically, otherwise it would cause high I/O overhead. It’s implemented as a job on the cluster instead which runs at 22:00 every day by default.

Data Layout and Protection

Instead of using RAIDs, Isilon uses FEC (Forward Error Correction) and more specifically a Reed-Solomon algorithm to protect data on a cluster. It’s similar to RAID5 in how it generates a protection block (or blocks) for each stripe. But it happens on a software level, instead of hardware as in storage arrays. So when a file comes in to a node, Isilon splits the file in stripe units of 128KB each, generates one FEC protection unit and distributes all of them between the nodes using back-end network. This is what is called “+1” protection level, where Isilon can sustain one disk or one node failure. Then you have “+2”, “+3” and “+4”. In “+4” you have four FECs per stripe and can sustain four disk or node failures. Note however that there is a rule that the number of data stripe units in a stripe has to be greater than number of FEC units. So the minimum requirement for “+4” protection level is 9 nodes in a cluster.


The second option is to use mirroring. You can have from 2x to 8x mirrors of your data. And the third option is “+2:1” and “+3:1” protection levels. These protection levels let you balance between the data protection and amount of the FEC overhead. For example “+2:1” setting compared to “+2” can sustain two drive failures or one node failure, instead of two node failure protection that “+2” offers. And it makes sense, since simultaneous two node failure is unlikely to happen. There is also a difference in how the data is laid out. In “+2” for each stripe Isilon uses one disk on each node and in “+2:1” it uses two disks on each node. And first FEC in this case goes to first subset of disks and second goes to second.

One benefit of not having RAID is that you can set protection level with folder or even file granularity. Which is impossible with conventional RAIDs. And what’s quite handy, you can change protection levels without recreation of storage volumes, as you might have to do while transitioning between some of the RAID levels. When you change protection level for any of the targets, Isilon creates a low priority job which redistributes data within the cluster.


Overview of NetApp Replication and HA features

August 9, 2013

NetApp has quite a bit of features related to replication and clustering:

  • HA pairs (including mirrored HA pairs)
  • Aggregate mirroring with SyncMirror
  • MetroCluster (Fabric and Stretched)
  • SnapMirror (Sync, Semi-Sync, Async)

It’s easy to get lost here. So lets try to understand what goes where.



SnapMirror is a volume level replication, which normally works over IP network (SnapMirror can work over FC but only with FC-VI cards and it is not widely used).

Asynchronous version of SnapMirror replicates data according to schedule. SnapMiror Sync uses NVLOGM shipping (described briefly in my previous post) to synchronously replicate data between two storage systems. SnapMirror Semi-Sync is in between and synchronizes writes on Consistency Point (CP) level.

SnapMirror provides protection from data corruption inside a volume. But with SnapMirror you don’t have automatic failover of any sort. You need to break SnapMirror relationship and present data to clients manually. Then resynchronize volumes when problem is fixed.


SyncMirror mirror aggregates and work on a RAID level. You can configure mirroring between two shelves of the same system and prevent an outage in case of a shelf failure.

SyncMirror uses a concept of plexes to describe mirrored copies of data. You have two plexes: plex0 and plex1. Each plex consists of disks from a separate pool: pool0 or pool1. Disks are assigned to pools depending on cabling. Disks in each of the pools must be in separate shelves to ensure high availability. Once shelves are cabled, you enable SyncMiror and create a mirrored aggregate using the following syntax:

> aggr create aggr_name -m -d disk-list -d disk-list

HA Pair

HA Pair is basically two controllers which both have connection to their own and partner shelves. When one of the controllers fails, the other one takes over. It’s called Cluster Failover (CFO). Controller NVRAMs are mirrored over NVRAM interconnect link. So even the data which hasn’t been committed to disks isn’t lost.


MetroCluster provides failover on a storage system level. It uses the same SyncMirror feature beneath it to mirror data between two storage systems (instead of two shelves of the same system as in pure SyncMirror implementation). Now even if a storage controller fails together with all of its storage, you are safe. The other system takes over and continues to service requests.

HA Pair can’t failover when disk shelf fails, because partner doesn’t have a copy to service requests from.

Mirrored HA Pair

You can think of a Mirrored HA Pair as HA Pair with SyncMirror between the systems. You can implement almost the same configuration on HA pair with SyncMirror inside (not between) the system. Because the odds of the whole storage system (controller + shelves) going down is highly unlike. But it can give you more peace of mind if it’s mirrored between two system.

It cannot failover like MetroCluster, when one of the storage systems goes down. The whole process is manual. The reasonable question here is why it cannot failover if it has a copy of all the data? Because MetroCluster is a separate functionality, which performs all the checks and carry out a cutover to a mirror. It’s called Cluster Failover on Disaster (CFOD). SyncMirror is only a mirroring facility and doesn’t even know that cluster exists.

Further Reading

Storwize V7000 with vSphere 5 storage configuration

December 1, 2012

storwizeInformation on how to configure Storwize for optimal performance is very scarce. I’ll try to build some understanding of it from bits an pieces gathered throughout the Internet and redbooks.

Barry Whyte gave many insights on Storwize internals in his blog. Particularly his “Configuring IBM Storwize V7000 and SVC for Optimal Performance” series of posts. I’ll quote him here. The main Storwize redbook “Implementing the IBM Storwize V7000 V6.3” is mostly an administration guide and gives no useful information on the topic. I find “SAN Volume Controller Best Practices and Performance Guidelines” way more helpful (Storwize firmware is built on SVC code).

Total Number of MDisks

That’s what Barry says:

… At the heart of each V7000 controller canister is an Intel Jasper Forrest (Sandy Bridge) based quad core CPU. … When we added the tried and trusted (SSA) DS8000 RAID functionality in 2010 (6.1.0) we therefore assigned RAID processing on a per mdisk basis to a single core. That means you need at least 4 arrays per V7000 to get maximal CPU core performance. …

Number of MDisks per Storage Pool

SVC Redbook:

The capability to stripe across disk arrays is the single most important performance advantage of the SVC; however, striping across more arrays is not necessarily better. The objective here is to only add as many arrays to a single Storage Pool as required to meet the performance objectives.

If the Storage Pool is already meeting its performance objectives, we recommend that, in most cases, you add the new MDisks to new Storage Pools rather than add the new MDisks to existing Storage Pools.

Table 5-1 shows the recommended number of arrays per Storage Pool that is appropriate for general cases.

Controller type       Arrays per Storage Pool
DS4000/DS5000         4 - 24
DS6000/DS8000         4 - 12
IBM Storwise V7000    4 - 12

The development recommendations for Storwize V7000 are summarized below:

  • One MDisk group per storage subsystem
  • One MDisk group per RAID array type (RAID 5 versus RAID 10)
  • One MDisk and MDisk group per disk type (10K versus 15K RPM, or 146 GB versus 300 GB)

There are situations where multiple MDisk groups are desirable:

  • Workload isolation
  • Short-stroking a production MDisk group
  • Managing different workloads in different groups

We recommend that you have at least two MDisk groups, one for key applications, another for everything else.

Number of LUNs per Storage Pool

SVC Redbook:

We generally recommend that you configure LUNs to use the entire array, which is especially true for midrange storage subsystems where multiple LUNs configured to an array have shown to result in a significant performance degradation. The performance degradation is attributed mainly to smaller cache sizes and the inefficient use of available cache, defeating the subsystem’s ability to perform “full stride writes” for Redundant Array of Independent Disks 5 (RAID 5) arrays. Additionally, I/O queues for multiple LUNs directed at the same array can have a tendency to overdrive the array.

Table 5-2 provides our recommended guidelines for array provisioning on IBM storage subsystems.

Controller type                     LUNs per array
IBM System Storage DS4000/DS5000    1
IBM System Storage DS6000/DS8000    1 - 2
IBM Storwize V7000                  1

General considerations

vsphere5-logoLets take a look at vSphere use case scenario on top of Storwize with 16 x 600GB SAS drives in control enclosure and 10 x 2TB NL-SAS in extension enclosure (our personal case).

First of all we need to decide how many arrays we need. Do we have different workloads? No. All storage will be assigned to virtual machines which have in general the same random read/write access pattern. Do we need to isolate workloads? Probably yes, it’s generally a good idea to separate highly critical production VMs from everything else. Do we have different drive types? Yes. Obviously we don’t want to mix drive types in one RAID. Are we going to make different RAID types? Again, yes. RAID 10 is appropriate on SAS and RAID 5 on NL-SAS. So two MDisks – one RAID 10 on SAS and one RAID 5 on NL-SAS would be enough. Storwize nodes have 4 cores each. It may seem that you would benefit from 4 MDisks, but in fact you won’t. Here what Barry says:

In the case where you only have 1 or 2 HDD arrays, then the core stuff doesn’t really come into play. Its only when you get to larger systems, where you are driving more I/O than a single RAID core can handle that you need to spread them.

This is also true if you are running all SSD arrays, so 24x SSD would be best split into 4 arrays to get maximum IOPs, whereas 24x HDD are not going to saturate a single core, so (if you could create a 23+P! [ you can’t 15+P is largest we support ] then it would perform as well as 2x 11+P etc

To storage pools. In our example we have two MDisks, so you simply make two storage pools. In future if you hit performance limit, you create additional MDisks and then you have two options. If each MDisk separately is able to sustain your performance requirements, you make additional storage pools and redistribute workload between them. If you have huge load on storage and even redistribution of VMs between two arrays doesn’t help, then you better combine two MDisks of each type in its own storage pool for striping between MDisks.

Same story for number of LUNs. IBM recommends one to one LUN to MDisk relationship. But read carefully. Recommendation comes from the fact that different workloads can clash and degrade array performance. But if we have generally the same I/O patterns coming to the array it’s safe to make several LUNs on it, until latency is in the acceptable range. Moreover, when it comes to vSphere and VMFS, it’s beneficial to have at least two volumes in terms of manageability. With several LUNs you will at least have an ability to move VMs between LUNs for reconfiguration purposes. Also keep in mind that ESXi 5 hypervisor limit each host to storage queue of depth 32 per LUN. It means that if you have one big LUN and many VMs running on the host, you can quickly reach queue limit. On the other hand do not create too many LUNs or you will oversubscribe storage processors (SPs).

Sample configuration

IBM recommends constructing both RAID 10 and RAID 5 arrays from 8 drives + 1 spare drive. But since we have 16 SAS and 10 NL-SAS I would launch CLI and create two arrays: one 14 drives + 2 spares RAID 10 and one 8 drives + 2 spares RAID 5 (or 9 drives + 1 spare, but it’s not a good idea to create RAID with uneven number of drives). Each RAID in its own pool. Several LUNs in each pool. I would go for 2TB LUNs.

IBM DS4700 copyback failed

August 27, 2012

If you have a global hot spare (GHS) drive when one of the active hard drives failes, your data is reconstructed to a GHS. Then, when you replace the failed drive, storage system automatically initiates a copyback, which gets the data from the GHS back to the replacement drive. Sometimes it doesn’t happen and replacement drive stays in an Unassigned state. If it is the case go to the DS Storage Manager, right click on the RAID array and select Replace Drives. There you should see the failed drive. Choose replacement from unassigned drives and click Replace Drive. Copyback will start immediately.

Take into consideration that copyback can be long-lasting, depending on the array size. If it is a production system and its performance is critical, right click on the logical drive, choose Change -> Modification Priority. There you can set how much resources will be allocated for modification (such as copyback, reconstruction, etc) and performance. Change it to Low for maximum performance.

Migrating physical Linux host to the VMware ESXi

May 2, 2012

Well, perhaps the easiest way to accomplish that is using VMware Converter from the start. I believe there is a Linux version. However, I took another route. I already had an Acronis backup image. So my solution was to use this image as a source, which I fed into Windows version of VMware Converter, which in its turn converted it to VMware format and created VMware virtual machine on ESXi server automatically.

Using this simple procedure you can get a working system. Not in my case. Original OS used a software RAID of two hard drives. So I had to boot from a live CD. Then I changed fstab and GRUB’s menu.lst and set /dev/sda1 (root volume) instead of /dev/md0 and /dev/sda2 (swap) in place of /dev/md1. Additionally, I had to reinject GRUB’s boot files:

grub-install –root-directory=/media/sda1/boot /dev/sda

Then, if it’s SUSE you will have to change “resume” switch in GRUB’s boot menu line to /dev/null. Then after you boot into the system, recreate swap partition and point to it in “resume” switch. If you won’t do that, you will end up with the following error during boot process:

Kernel panic – not syncing: I/O error reading memory image

One tricky issue I had in all this story was related to kernel. As I’ve already mentioned original operating system worked on top of software RAID. And its initrd image won’t detect ordinary virtual SCSI hard drive during boot. So I had to boot from the SUSE installation CD and install standard kernel on top of original system. It solved the issue. Additionally I had to choose Russian language as primary during kernel installation, otherwise I ended up with unreadable symbols inside the system. But it’s not necessary for majority of cases.

I hope my experience will be helpful for other sysadmins.

NetApp storage architecture

October 9, 2011

All of us are get used to SATA disk drives connected to our workstations and we call it storage. Some organizations has RAID arrays. RAID is one level of logical abstraction which combine several hard drives to form logical drive with greater size/reliability/speed. What would you say if I’d tell you that NetApp has following terms in its storage architecture paradigm: disk, RAID group, plex, aggregate, volume, qtree, LUN, directory, file. Lets try to understand how all this work together.

RAID in NetApp terminology is called RAID group. Unlike ordinary storage systems NetApp works mostly with RAID 4 and RAID-DP. Where RAID 4 has one separate disk for parity and RAID-DP has two. Don’t think that it leads to performance degradation. NetApp has very efficient implementation of these RAID levels.

Plex is collection of RAID groups and is used for RAID level mirroring. For instance if you have two disk shelves and SyncMirror license then you can create plex0 from first shelf drives and plex1 from second shelf.  This will protect you from one disk shelf failure.

Aggregate is simply a highest level of hardware abstraction in NetApp and is used to manage plexes, raid groups, etc.

Volume is a logical file system. It’s a well-known term in Windows/Linux/Unix realms and serves for the same goal. Volume may contain files, directories, qtrees and LUNs. It’s the highest level of abstraction from the logical point of view. Data in volume can be accessed by any of protocols NetApp supports: NFS, CIFS, iSCSI, FCP, WebDav, HTTP.

Qtree can contain files and directories or even LUNs and is used to put security and quota rules on contained objects with user/group granularity.

LUN is necessary to access data via block-level protocols like FCP and iSCSI. Files and directories are used with file-level protocols NFS/CIFS/WebDav/HTTP.

Intel RAID 1

January 8, 2011

When I was overclocking my system I needed to update my BIOS to the latest version. After an update my system didn’t boot. I guess because BIOS configuration options saved in ROM weren’t compatible with new BIOS. I erased configuration and system booted just fine. But I had RAID 1 configuration and it was also erased. I entered BIOS, changed SATA mode from IDE to RAID. Then rebooted and entered Intel Matrix Storage Manager Option ROM Utility. In fact RAID data seemed to be fine. Both hard drives had Member Disk(0) marks and RAID status was Normal. None the less I couldn’t boot my system. After OS selection I saw just black screen.

Solution was simple.  In Option ROM Utility I reseted disks to non-RAID (I have no idea how it’s different from just deleting a RAID volume). Then I loaded Intel Matrix Storage Console from Windows and went Actions -> Create RAID Volume from Existing Hard Drive. In configuration wizard I just chose RAID 1, one drive as source and the other as destination and voila! After an hour I got working RAID 1 configuration without any data loss.