Posts Tagged ‘vCenter’

Magic behind NetApp VSC Backup/Restore

June 12, 2013

netapp_dpNetApp Virtual Storage Console is a plug-in for VMware vCenter which provides capabilities to perform instant backup/restore using NetApp snapshots. It uses several underlying NetApp features to accomplish its tasks, which I want to describe here.

Backup Process

When you configure a backup job in VSC, what VSC does, is it simply creates a NetApp snapshot for a target volume on a NetApp filer. Interestingly, if you have two VMFS datastores inside one volume, then both LUNs will be snapshotted, since snapshots are done on the volume level. But during the datastore restore, the second volume will be left intact. You would think that if VSC reverts the volume to the previously made snapshot, then both datastores should be affected, but that’s not the case, because VSC uses Single File SnapRestore to restore the LUN (this will be explained below). Creating several VMFS LUNs inside one volume is not a best practice. But it’s good to know that VSC works correctly in this case.

Same thing for VMs. There is no sense of backing up one VM in a datastore, because VSC will make a volume snapshot anyway. Backup the whole datastore in that case.

Datastore Restore

After a backup is done, you have three restore options. The first and least useful kind is a datastore restore. The only use case for such restore that I can think of is disaster recovery. But usually disaster recovery procedures are separate from backups and are based on replication to a disaster recovery site.

VSC uses NetApp’s Single File SnapRestore (SFSR) feature to restore a datastore. In case of a SAN implementation, SFSR reverts only the required LUN from snapshot to its previous state instead of the whole volume. My guess is that SnapRestore uses LUN clone/split functionality in background, to create new LUN from the snapshot, then swap the old with the new and then delete the old. But I haven’t found a clear answer to that question.

For that functionality to work, you need a SnapRestore license. In fact, you can do the same trick manually by issuing a SnapRestore command:

> snap restore -t file -s nightly.0 /vol/vol_name/vmfs_lun_name

If you have only one LUN in the volume (and you have to), then you can simply restore the whole volume with the same effect:

> snap restore -t vol -s nightly.0 /vol/vol_name

VM Restore

VM restore is also a bit controversial way of restoring data. Because it completely removes the old VM. There is no way to keep the old .vmdks. You can use another datastore for particular virtual hard drives to restore, but it doesn’t keep the old .vmdks even in this case.

VSC uses another mechanism to perform VM restore. It creates a LUN clone (don’t confuse with FlexClone,which is a volume cloning feature) from a snapshot. LUN clone doesn’t use any additional space on the filer, because its data is mapped to the blocks which sit inside the snapshot. Then VSC maps the new LUN to the ESXi host, which you specify in the restore job wizard. When datastore is accessible to the ESXi host, VSC simply removes the old VMDKs and performs a storage vMotion from the clone to the active datastore (or the one you specify in the job). Then clone is removed as part of a clean up process.

The equivalent cli command for that is:

> lun clone create /vol/clone_vol_name -o noreserve -b /vol/vol_name nightly.0

Backup Mount

Probably the most useful way of recovery. VSC allows you to mount the backup to a particular ESXi host and do whatever you want with the .vmdks. After the mount you can connect a virtual disk to the same or another virtual machine and recover the data you need.

If you want to connect the disk to the original VM, make sure you changed the disk UUID, otherwise VM won’t boot. Connect to the ESXi console and run:

# vmkfstools -J setuuid /vmfs/volumes/datastore/VM/vm.vmdk

Backup mount uses the same LUN cloning feature. LUN is cloned from a snapshot and is connected as a datastore. After an unmount LUN clone is destroyed.

Some Notes

VSC doesn’t do a good cleanup after a restore. As part of the LUN mapping to the ESXi hosts, VSC creates new igroups on the NetApp filer, which it doesn’t delete after the restore is completed.

What’s more interesting, when you restore a VM, VSC deletes .vmdks of the old VM, but leaves all the other files: .vmx, .log, .nvram, etc. in place. Instead of completely substituting VM’s folder, it creates a new folder vmname_1 and copies everything into it. So if you use VSC now and then, you will have these old folders left behind.

Mounting VMware Virtual Disks

June 11, 2013

H_Storage04There are millions of posts on that topic all over the Internet. Just another repetition mostly for myself.

VMware has Virtual Disk Development Kit (VDDK) which is more of an API for backup software vendors. But it includes a handy tool called vmware-mount, which gives you an ability to mount VMware virtual disks (.vmdk) from wherever you want.

Download VDDK from VMware site. It’s free. And then run vmware-mount with the following keys:

> vmware-mount driveletter: “[vmfs_datastore] vmname/diskname.vmdk” /i:”datacentername/vm/vmname” /h:vcname /u:username /s:password

Choose drive letter, specify vmdk path, inventory path to VM (put ‘vm’ in lowercase between datacenter and vm name, upper case will give you an error) and vCenter or ESXi host name.

Note however, that you can mount only vmdks from powered off VMs. But there is a workaround. You can mount vmdk from online VMs in read-only mode if you make a VM snapshot. Then the original vmdk won’t be locked by ESXi server and you will be able to mount it.

To unmount a vmdk run:

> vmware-mount diskletter: /d

There are also several GUI tools to mount vmdks. But vmware-mount is enough for me.

Limiting the number of concurrent storage vMotions

June 6, 2013

vmw-dgrm-vsphr-087b-diagram1VMware vCenter allows several concurrent storage vMotions on a datastore. But it can negatively impact your production environment, by hammering your underlying storage. If you want to migrate several virtual machines to another datastore, it’s much safer to do that one by one. But it’s too much manual work.

There is a simple way to limit the number of concurrent storage vMotions by configuring vCenter advanced settings. There are a group of resource management parameters for network, host and datastore limits which apply to vMotion and Storage vMotion. They are called limits and costs. For ESXi 4.1 default datastore limit for migration with Storage vMotion is 128. And datastore resource cost for Storage vMotion is 16 (defaults for other versions of ESXi can be found here: Limits on Simultaneous Migrations). It basically means that 8 concurrent storage vMotions is allowed for each datastore. So to allow only one storage vMotion at a time you can either change the limit to 16 or cost to 128.

Lets say we choose to change the cost to 128. There are two ways of doing it. The first one is to edit vCenter vpxd.cfg file and add the following stanza between <vpxd></vpxd> tags:

<ResourceManager>
<CostPerEsx41SVmotion>128</CostPerEsx41SVmotion>
</ResourceManager>

The second simpler one way is to edit vCenter -> Administration -> vCenter Server Settings -> Advanced Settings and add config.vpxd.ResourceManager.CostPerEsx41SVmotion key with value equal to 128. You will probably need to reboot vCenter after that.

There is one moment, however. If you migrate VMs from say 3 source datastores to 1 destination, then 3 concurrent storage vMotion will kick off. I do not know what is the reason for that, but that’s what I found from the practice.

Unexpected Deduplication Impact on VMware I/O Latency

May 28, 2013

NetApp deduplication is a postponed process. During normal operation Data ONTAP only calculates hashes for the data blocks. Actual deduplication is carried out off-hours as per configured schedule. Hash calculation doesn’t affect performance in most cases. I talked about that in my previous post. NetApp states in its documentation that deduplication is a low-priority process:

When one deduplication process is running, there is 0% to 15% performance degradation on other applications.

Once I faced a situation when deduplication was configured to be carried out during business hours on one of the volumes. No one noticed that at some point volume run out of space and Data ONTAP wasn’t able to perform deduplication from that time. Situation became worse when Data ONTAP was upgraded from version 7.3.2 to 8.1.0. Now during deduplication filer tried to upgrade the fingerprint metadata to a new version at 15:00 every day with the message: “Fingerprint is being upgraded” and failed. It seems that the metadata upgrade is a very resource-intensive process and heavily affects I/O latency.

This volume was not a VMware datastore, but it sit on the same aggregate together with the several VMFS LUNs. Here what happened to the VMware I/O latency every day at 15:00 (click to enlarge):

dedup_issue_ed

I deleted the host name and the datastores names from the graph. You can see the large latency spike, which won’t turn yourVMs into kernel panic, but it’s not the thing you would want your production environment to experience every day.

The solution was simple. After space was increased on this volume, deduplication metadata upgrade performed successfully and problem went away. Additionally, deduplication was shifted to off-hours.

The simple lesson to learn: don’t schedule deduplication during the day, you never know what could possibly go wrong.