Posts Tagged ‘igroup’

Magic behind NetApp VSC Backup/Restore

June 12, 2013

netapp_dpNetApp Virtual Storage Console is a plug-in for VMware vCenter which provides capabilities to perform instant backup/restore using NetApp snapshots. It uses several underlying NetApp features to accomplish its tasks, which I want to describe here.

Backup Process

When you configure a backup job in VSC, what VSC does, is it simply creates a NetApp snapshot for a target volume on a NetApp filer. Interestingly, if you have two VMFS datastores inside one volume, then both LUNs will be snapshotted, since snapshots are done on the volume level. But during the datastore restore, the second volume will be left intact. You would think that if VSC reverts the volume to the previously made snapshot, then both datastores should be affected, but that’s not the case, because VSC uses Single File SnapRestore to restore the LUN (this will be explained below). Creating several VMFS LUNs inside one volume is not a best practice. But it’s good to know that VSC works correctly in this case.

Same thing for VMs. There is no sense of backing up one VM in a datastore, because VSC will make a volume snapshot anyway. Backup the whole datastore in that case.

Datastore Restore

After a backup is done, you have three restore options. The first and least useful kind is a datastore restore. The only use case for such restore that I can think of is disaster recovery. But usually disaster recovery procedures are separate from backups and are based on replication to a disaster recovery site.

VSC uses NetApp’s Single File SnapRestore (SFSR) feature to restore a datastore. In case of a SAN implementation, SFSR reverts only the required LUN from snapshot to its previous state instead of the whole volume. My guess is that SnapRestore uses LUN clone/split functionality in background, to create new LUN from the snapshot, then swap the old with the new and then delete the old. But I haven’t found a clear answer to that question.

For that functionality to work, you need a SnapRestore license. In fact, you can do the same trick manually by issuing a SnapRestore command:

> snap restore -t file -s nightly.0 /vol/vol_name/vmfs_lun_name

If you have only one LUN in the volume (and you have to), then you can simply restore the whole volume with the same effect:

> snap restore -t vol -s nightly.0 /vol/vol_name

VM Restore

VM restore is also a bit controversial way of restoring data. Because it completely removes the old VM. There is no way to keep the old .vmdks. You can use another datastore for particular virtual hard drives to restore, but it doesn’t keep the old .vmdks even in this case.

VSC uses another mechanism to perform VM restore. It creates a LUN clone (don’t confuse with FlexClone,which is a volume cloning feature) from a snapshot. LUN clone doesn’t use any additional space on the filer, because its data is mapped to the blocks which sit inside the snapshot. Then VSC maps the new LUN to the ESXi host, which you specify in the restore job wizard. When datastore is accessible to the ESXi host, VSC simply removes the old VMDKs and performs a storage vMotion from the clone to the active datastore (or the one you specify in the job). Then clone is removed as part of a clean up process.

The equivalent cli command for that is:

> lun clone create /vol/clone_vol_name -o noreserve -b /vol/vol_name nightly.0

Backup Mount

Probably the most useful way of recovery. VSC allows you to mount the backup to a particular ESXi host and do whatever you want with the .vmdks. After the mount you can connect a virtual disk to the same or another virtual machine and recover the data you need.

If you want to connect the disk to the original VM, make sure you changed the disk UUID, otherwise VM won’t boot. Connect to the ESXi console and run:

# vmkfstools -J setuuid /vmfs/volumes/datastore/VM/vm.vmdk

Backup mount uses the same LUN cloning feature. LUN is cloned from a snapshot and is connected as a datastore. After an unmount LUN clone is destroyed.

Some Notes

VSC doesn’t do a good cleanup after a restore. As part of the LUN mapping to the ESXi hosts, VSC creates new igroups on the NetApp filer, which it doesn’t delete after the restore is completed.

What’s more interesting, when you restore a VM, VSC deletes .vmdks of the old VM, but leaves all the other files: .vmx, .log, .nvram, etc. in place. Instead of completely substituting VM’s folder, it creates a new folder vmname_1 and copies everything into it. So if you use VSC now and then, you will have these old folders left behind.

Connecting VMware ESXi Hosts to NetApp: MPIO Configuration

May 23, 2013


NetApp filers are active/active ALUA arrays. It means that you can access LUNs configured on one controller via the second one. But access to the partner’s LUNs is provided through the internal interconnect and is always slower. That’s why the paths to the controller through the partner are called “unoptimized”. Their primary usage is to provide backup paths in case of a failover.

Fixed path selection

VMware hosts by default use “VMW_SATP_DEFAULT_AA” Storage Array Type Policy (SATP) and “Fixed” Path Selection Policy (PSP) for active/active arrays. If ESXi host is configured with these SATP and PSP, it will access each LUN via one particular path, even if you have two FC ports on each of the controllers.

VMware host can’t automatically identify optimized path. So you can either set it manually or use functionality of NetApp Virtual Storage Console (VSC) plug-in for VMware. Just go to the Monitoring and Host Configuration -> Overview section of VSC, right click on ESXi host and click “Set Recommended Values”. If you don’t do that, ESXi hosts will run I/O traffic through a randomly identified path, which could turn out to be unoptimized. It means you will push heaps of I/O through the partner node and experience higher latencies.

You can check if you’re using non-optimized paths by looking for such warnings on NetApps:

filer_01> Mon May 6 10:30:45 EST [filer_01: ems.engine.inputSuppress:error]: Event ‘scsitarget.partnerPath.misconfigured’ suppressed 327 times since Mon May 6 09:30:48 EST 2013.
Mon May 6 10:30:45 EST [filer_01: scsitarget.partnerPath.misconfigured:error]: FCP Partner Path Misconfigured – Host I/O access through a non-primary and non-optimal path was detected.

Or run “lun stats -o” and look for huge numbers under “Partner Ops” and “Partner KBytes”.

ALUA configuration

If you want to utilize both links to the controller in a round robin fashion, you need to do some additional configuration. You should enable ALUA for your VMware ESXi hosts initiator group on NetApp:

igroup set <group> alua yes

Now you need to reboot ESXi host. After a reboot it will see that storage is ALUA-capable and change SATP to VMW_SATP_ALUA and PSP is “Most Recently Used”. To utilize load balancing between two controller paths you need to change PSP to “Round Robin”. Again, you can do that either manually or via VSC.

Note: Don’t ever use ALUA and VMW_SATP_ALUA if you have Windows Server 2003 MSCS or Windows Server 2008 Failover Cluster with shared RDM LUNs. It’s an unsupported configuration and you can run into a cluster failure situation. It’s described in many places:

In this case leave SATP as “VMW_SATP_DEFAULT_AA”,  PSP as “Fixed” and make sure that you use optimized paths.