Posts Tagged ‘aggregate’

How to move aggregates between NetApp controllers

September 25, 2013

Stop Sign_91602




We had an issue with high CPU usage on one of the NetApp controllers servicing a couple of NFS datastores to VMware ESX cluster. HA pair of FAS2050 had two shelves, both of them owned by the first controller. The obvious solution for us was to reassign disks from one of the shelves to the other controller to balance the load. But how do you do this non-disruptively? Here is the plan.

In our setup we had two controllers (filer1, filer2), two shelves (shelf1, shelf2) both assigned to filer1. And two aggregates, each on its own shelf (aggr0 on shelf0, aggr1 on shelf1). Say, we want to reassign disks from shelf2 to filer2.

First step is to migrate all of the VMs from the shelf2 to shelf1. Because operation is obviously disruptive to the hosts accessing data from the target shelf. Once all VMs are evacuated, offline all volumes and an aggregate, to prevent any data corruption (you can’t take aggregate offline from online state, so change it to restricted first).

If you prefer to reassign disks in two steps, as described in NetApp Professional Services Tech Note #021: Changing Disk Ownership, don’t forget to disable automatic ownership assignment on both controllers, otherwise disks will be assigned back to the same controller again, right after you unown them:

> options disk.auto_assign off

It’s not necessary if you change ownership in one step as shown below.

Next step is to actually reassign the disks. Since they are already part of an aggregate you will need to force the ownership change:

filer1> disk assign 1b.01.00 -o filer2 -f

filer1> disk assign 1b.01.01 -o filer2 -f

filer1> disk assign 1b.01.nn -o filer2 -f

If you do not force disk reassignment you will get an error:

Assign request failed for disk 1b.01.0. Reason:Disk is part of a failed or offline aggregate or volume. Changing its owner may prevent aggregate or volume from coming back online. Ownership may be changed only by using the appropriate force option.

When all disks are moved across to filer2, new aggregate will show up in the list of aggregates on filer2 and you’ll be able to bring it online. If you can’t see the aggregate, force filer to rescan the drives by running:

filer2> disk show

The old aggregate will still be seen in the list on filer1. You can safely remove it:

filer1> aggr destroy aggr1


Overview of NetApp Replication and HA features

August 9, 2013

NetApp has quite a bit of features related to replication and clustering:

  • HA pairs (including mirrored HA pairs)
  • Aggregate mirroring with SyncMirror
  • MetroCluster (Fabric and Stretched)
  • SnapMirror (Sync, Semi-Sync, Async)

It’s easy to get lost here. So lets try to understand what goes where.



SnapMirror is a volume level replication, which normally works over IP network (SnapMirror can work over FC but only with FC-VI cards and it is not widely used).

Asynchronous version of SnapMirror replicates data according to schedule. SnapMiror Sync uses NVLOGM shipping (described briefly in my previous post) to synchronously replicate data between two storage systems. SnapMirror Semi-Sync is in between and synchronizes writes on Consistency Point (CP) level.

SnapMirror provides protection from data corruption inside a volume. But with SnapMirror you don’t have automatic failover of any sort. You need to break SnapMirror relationship and present data to clients manually. Then resynchronize volumes when problem is fixed.


SyncMirror mirror aggregates and work on a RAID level. You can configure mirroring between two shelves of the same system and prevent an outage in case of a shelf failure.

SyncMirror uses a concept of plexes to describe mirrored copies of data. You have two plexes: plex0 and plex1. Each plex consists of disks from a separate pool: pool0 or pool1. Disks are assigned to pools depending on cabling. Disks in each of the pools must be in separate shelves to ensure high availability. Once shelves are cabled, you enable SyncMiror and create a mirrored aggregate using the following syntax:

> aggr create aggr_name -m -d disk-list -d disk-list

HA Pair

HA Pair is basically two controllers which both have connection to their own and partner shelves. When one of the controllers fails, the other one takes over. It’s called Cluster Failover (CFO). Controller NVRAMs are mirrored over NVRAM interconnect link. So even the data which hasn’t been committed to disks isn’t lost.


MetroCluster provides failover on a storage system level. It uses the same SyncMirror feature beneath it to mirror data between two storage systems (instead of two shelves of the same system as in pure SyncMirror implementation). Now even if a storage controller fails together with all of its storage, you are safe. The other system takes over and continues to service requests.

HA Pair can’t failover when disk shelf fails, because partner doesn’t have a copy to service requests from.

Mirrored HA Pair

You can think of a Mirrored HA Pair as HA Pair with SyncMirror between the systems. You can implement almost the same configuration on HA pair with SyncMirror inside (not between) the system. Because the odds of the whole storage system (controller + shelves) going down is highly unlike. But it can give you more peace of mind if it’s mirrored between two system.

It cannot failover like MetroCluster, when one of the storage systems goes down. The whole process is manual. The reasonable question here is why it cannot failover if it has a copy of all the data? Because MetroCluster is a separate functionality, which performs all the checks and carry out a cutover to a mirror. It’s called Cluster Failover on Disaster (CFOD). SyncMirror is only a mirroring facility and doesn’t even know that cluster exists.

Further Reading

Resize limits for SAN LUNs

July 1, 2013

If you run lun resize command on NetApp you might run into the following error:

lun resize: New size exceeds this LUN’s initial geometry

The reason behind it is that each SAN LUN has head/cylinder/sector geometry. It’s not an actual physical mapping to the underlying disks and has no meaning these days. It’s simply a SCSI protocol artifact. But it imposes limitation on maximum LUN resize. Geometry is chosen at initial LUN creation and cannot be changed. Roughly you can resize the LUN to the size, which is 10 times bigger than the size at the time of creation. For example, the 50GB LUN can be extended to the maximum of 502GB. See the table below for the maximum sizes:

Initial Size   Maximum Sizedata-storage
< 50g          502g
51-100g        1004g
101-150g       1506g
151-200g       2008g
201-251g       2510g
252-301g       3012g
302-351g       3514g
352-401g       4016g

To check the maximum size for particular LUN use the following commands:

> priv set diag
> lun gemetry lun_path
> priv set

If you run into this issue, unfortunately you will need to create a new LUN, copy all the data using robocopy for example and make a cutover. Because such features as volume level SnapMirror or ndmpcopy will recreate the LUN’s geometry together with the data.

Replacing hard drives in a NetApp aggregate

May 30, 2013

netapp_disk_driveNetApp uses certain rules to assign hot spares in case of a failure. It always tries to use the exact match, but if it’s not there, the best available spare is used. “The best” means that if you have an aggregate which consists of 1TB hard drives and you have only 2TB spare left, then this 2TB spare will be downsized to 1TB and used as a data disk. After that, when you receive a correct size replacement from NetApp, you need to exchange the downsized 2TB hard drive with the delivered 1TB spare. To accomplish that, use the following command:

> disk replace start disk_name spare_disk_name

It will take considerable amount of time to copy the data. In my case it was 6.5 hours for a 1TB drive.

When the process finishes, replaced drive becomes a new spare. It’s wise to zero it out right away, so that it could be easily used again as a spare. Otherwise when time comes you’ll be waiting hours before it could be added in place of the failed drive:

> disk zero spares

As a side note I want to mention that you cannot take disks out of the raid group. There is no way to shrink aggregates. The only thing you can make is to replace a hard drive with another one.

Unexpected Deduplication Impact on VMware I/O Latency

May 28, 2013

NetApp deduplication is a postponed process. During normal operation Data ONTAP only calculates hashes for the data blocks. Actual deduplication is carried out off-hours as per configured schedule. Hash calculation doesn’t affect performance in most cases. I talked about that in my previous post. NetApp states in its documentation that deduplication is a low-priority process:

When one deduplication process is running, there is 0% to 15% performance degradation on other applications.

Once I faced a situation when deduplication was configured to be carried out during business hours on one of the volumes. No one noticed that at some point volume run out of space and Data ONTAP wasn’t able to perform deduplication from that time. Situation became worse when Data ONTAP was upgraded from version 7.3.2 to 8.1.0. Now during deduplication filer tried to upgrade the fingerprint metadata to a new version at 15:00 every day with the message: “Fingerprint is being upgraded” and failed. It seems that the metadata upgrade is a very resource-intensive process and heavily affects I/O latency.

This volume was not a VMware datastore, but it sit on the same aggregate together with the several VMFS LUNs. Here what happened to the VMware I/O latency every day at 15:00 (click to enlarge):


I deleted the host name and the datastores names from the graph. You can see the large latency spike, which won’t turn yourVMs into kernel panic, but it’s not the thing you would want your production environment to experience every day.

The solution was simple. After space was increased on this volume, deduplication metadata upgrade performed successfully and problem went away. Additionally, deduplication was shifted to off-hours.

The simple lesson to learn: don’t schedule deduplication during the day, you never know what could possibly go wrong.

NetApp storage architecture

October 9, 2011

All of us are get used to SATA disk drives connected to our workstations and we call it storage. Some organizations has RAID arrays. RAID is one level of logical abstraction which combine several hard drives to form logical drive with greater size/reliability/speed. What would you say if I’d tell you that NetApp has following terms in its storage architecture paradigm: disk, RAID group, plex, aggregate, volume, qtree, LUN, directory, file. Lets try to understand how all this work together.

RAID in NetApp terminology is called RAID group. Unlike ordinary storage systems NetApp works mostly with RAID 4 and RAID-DP. Where RAID 4 has one separate disk for parity and RAID-DP has two. Don’t think that it leads to performance degradation. NetApp has very efficient implementation of these RAID levels.

Plex is collection of RAID groups and is used for RAID level mirroring. For instance if you have two disk shelves and SyncMirror license then you can create plex0 from first shelf drives and plex1 from second shelf.  This will protect you from one disk shelf failure.

Aggregate is simply a highest level of hardware abstraction in NetApp and is used to manage plexes, raid groups, etc.

Volume is a logical file system. It’s a well-known term in Windows/Linux/Unix realms and serves for the same goal. Volume may contain files, directories, qtrees and LUNs. It’s the highest level of abstraction from the logical point of view. Data in volume can be accessed by any of protocols NetApp supports: NFS, CIFS, iSCSI, FCP, WebDav, HTTP.

Qtree can contain files and directories or even LUNs and is used to put security and quota rules on contained objects with user/group granularity.

LUN is necessary to access data via block-level protocols like FCP and iSCSI. Files and directories are used with file-level protocols NFS/CIFS/WebDav/HTTP.