Posts Tagged ‘disaster recovery’

Zerto Overview

March 6, 2014

zerto-logoZerto is a VM replication product which works on a hypervisor level. In contrast to array level replication, which SRM has been using for a long time, it eliminates storage array from the equation and all the complexities which used to come along with it (SRAs, splitting the LUNs for replicated and non-replicated VMs, potential incompatibilities between the orchestrated components, etc).

Basic Operation

Zerto consists of two components: ZVM (Zerto Virtual Manger) and VRA (Virtual Replication Appliance). VRAs are VMs that need to be installed on each ESXi host within the vCenter environment (performed in automated fashion from within ZVM console). ZVM manages VRAs and all the replication settings and is installed one per vCenter. VRA mirrors protected VMs I/O operations to the recovery site. VMs are grouped in VPGs (Virtual Protection Groups), which can be used as a consistency group or just a container.

Protected VMs can be preseeded  to DR site. But what Zerto essentially does is it replicates VM disks to any datastore on recovery site where you point it to and then tracks changes in what is called a journal volume. Journal is created for each VM and is kept as a VMDK within the “ZeRTO volumes” folder on a target datastore. Every few seconds Zerto creates checkpoints on a journal, which serve as crash consistent recovery points. So you can recover to any point in time, with a few seconds granularity. You can set the journal length in hours, depending on how far you potentially would want to go back. It can be anywhere between 1 and 120 hours.Data-Replication-over-WAN

VMs are kept unregistered from vCenter on DR site and VM configuration data is kept in Zerto repository. Which essentially means that if an outage happens and something goes really wrong and Zerto fails to bring up VMs on DR site you will need to recreate VMs manually. But since VMDKs themselves are kept in original format you will still be able to attach them to VMs and power them on.

Failover Scenarios

There are four failover scenarios within Zerto:

  • Move Operation – VMs are shut down on production site, unregistered from inventory, powered on at DR site and protection is reversed if you decide to do so. If you choose not to reverse protection, VMs are completely removed from production site and VPG is marked as “Needs Configuration”. This scenario can be seen as a planned migration of VMs between the sites and needs both sites to be healthy and operational.
  • Failover Operation – is used in disaster scenario when production site might be unavailable. In this case Zerto brings up protected VMs on DR site, but it does not try to remove VMs from production site inventory and leave them as is. If production site is still accessible you can optionally select to shutdown VMs. You cannot automatically reverse protection in this scenario, VPG is marked as “Needs Configuration” and can be activated later. And when it is activated, Zerto does all the clean up operations on the former production site: shuts down VMs (if they haven’t been already), unregister them from inventory and move to VRA folder on the datastore.
  • Failover Test Operation – this is for failover testing and brings up VMs on DR site in a configured bubble network which is normally not uplinked to any physical network. VMs continue to run on both sites. Note that VMs disk files in this scenario are not moved to VMs folders (as in two previous scenarios) and are just connected from VRA VM folder. You would also notice that Zerto created second journal volume which is called “scratch” journal. Changes to the VM that is running on DR site are saved to this journal while it’s being tested.
  • Clone Operation – VMs are cloned on DR site and connected to network. VMs are not automatically powered on to prevent potential network conflicts. This can be used for instance in DR site testing, when you want to check actual networking connectivity, instead of connecting VMs to an isolated network. Or for implementing backups, cloned environment for applications testing, etc.

Zerto Journal Sizing

By default journal history is configured as 4 hours and journal size is unlimited. Depending on data change rate within the VM journal can be smaller or larger. 15GB is approximately enough storage to support a virtual machine with 1TB of storage, assuming a 10% change rate per day with four hours of journal history saved. Zerto has a Journal Sizing Tool which helps to size journals. You can create a separate journal datastore as well.

Zerto compared to VMware Replication and SRM

There are several replication products in the market from VMware. Standalone VMware replication, VMware replication + SRM orchestraion and SRM array-based replication. If you want to know more on how they compare to Zerto, you can read the articles mentioned in references below. One apparent Zerto advantage, which I want to mention here, is integration with vCloud Director, which is essential for cloud providers who offer DRaaS solutions. SRM has no vCloud Director support.



Magic behind NetApp VSC Backup/Restore

June 12, 2013

netapp_dpNetApp Virtual Storage Console is a plug-in for VMware vCenter which provides capabilities to perform instant backup/restore using NetApp snapshots. It uses several underlying NetApp features to accomplish its tasks, which I want to describe here.

Backup Process

When you configure a backup job in VSC, what VSC does, is it simply creates a NetApp snapshot for a target volume on a NetApp filer. Interestingly, if you have two VMFS datastores inside one volume, then both LUNs will be snapshotted, since snapshots are done on the volume level. But during the datastore restore, the second volume will be left intact. You would think that if VSC reverts the volume to the previously made snapshot, then both datastores should be affected, but that’s not the case, because VSC uses Single File SnapRestore to restore the LUN (this will be explained below). Creating several VMFS LUNs inside one volume is not a best practice. But it’s good to know that VSC works correctly in this case.

Same thing for VMs. There is no sense of backing up one VM in a datastore, because VSC will make a volume snapshot anyway. Backup the whole datastore in that case.

Datastore Restore

After a backup is done, you have three restore options. The first and least useful kind is a datastore restore. The only use case for such restore that I can think of is disaster recovery. But usually disaster recovery procedures are separate from backups and are based on replication to a disaster recovery site.

VSC uses NetApp’s Single File SnapRestore (SFSR) feature to restore a datastore. In case of a SAN implementation, SFSR reverts only the required LUN from snapshot to its previous state instead of the whole volume. My guess is that SnapRestore uses LUN clone/split functionality in background, to create new LUN from the snapshot, then swap the old with the new and then delete the old. But I haven’t found a clear answer to that question.

For that functionality to work, you need a SnapRestore license. In fact, you can do the same trick manually by issuing a SnapRestore command:

> snap restore -t file -s nightly.0 /vol/vol_name/vmfs_lun_name

If you have only one LUN in the volume (and you have to), then you can simply restore the whole volume with the same effect:

> snap restore -t vol -s nightly.0 /vol/vol_name

VM Restore

VM restore is also a bit controversial way of restoring data. Because it completely removes the old VM. There is no way to keep the old .vmdks. You can use another datastore for particular virtual hard drives to restore, but it doesn’t keep the old .vmdks even in this case.

VSC uses another mechanism to perform VM restore. It creates a LUN clone (don’t confuse with FlexClone,which is a volume cloning feature) from a snapshot. LUN clone doesn’t use any additional space on the filer, because its data is mapped to the blocks which sit inside the snapshot. Then VSC maps the new LUN to the ESXi host, which you specify in the restore job wizard. When datastore is accessible to the ESXi host, VSC simply removes the old VMDKs and performs a storage vMotion from the clone to the active datastore (or the one you specify in the job). Then clone is removed as part of a clean up process.

The equivalent cli command for that is:

> lun clone create /vol/clone_vol_name -o noreserve -b /vol/vol_name nightly.0

Backup Mount

Probably the most useful way of recovery. VSC allows you to mount the backup to a particular ESXi host and do whatever you want with the .vmdks. After the mount you can connect a virtual disk to the same or another virtual machine and recover the data you need.

If you want to connect the disk to the original VM, make sure you changed the disk UUID, otherwise VM won’t boot. Connect to the ESXi console and run:

# vmkfstools -J setuuid /vmfs/volumes/datastore/VM/vm.vmdk

Backup mount uses the same LUN cloning feature. LUN is cloned from a snapshot and is connected as a datastore. After an unmount LUN clone is destroyed.

Some Notes

VSC doesn’t do a good cleanup after a restore. As part of the LUN mapping to the ESXi hosts, VSC creates new igroups on the NetApp filer, which it doesn’t delete after the restore is completed.

What’s more interesting, when you restore a VM, VSC deletes .vmdks of the old VM, but leaves all the other files: .vmx, .log, .nvram, etc. in place. Instead of completely substituting VM’s folder, it creates a new folder vmname_1 and copies everything into it. So if you use VSC now and then, you will have these old folders left behind.

NetApp SnapMirror Optimization

May 31, 2013

gzipSnapMirroring to disaster recovery site requires huge amount of data to be transferred over the WAN link. In some cases replication can significantly lag from the defined schedule. There are two ways to reduce the amount of traffic and speed up replication: deduplication and compression.

If you apply deduplication to the replicated volumes, you simply reduce the amount of data you need to be transferred. You can read how to enable deduplication in my previous post.

Compression is a less known feature of SnapMirror. What it does is compression of the data being transferred on the source and decompression on the destination. Data inside the volume is left intact.

To enable SnapMirror compression you first need to make sure, that all your connections in snapmirror.conf file have names, like:


Then use ‘compression=enable’ configuration option to enable it for particular SnapMirror:

connection_name:src_vol dst_system:dst_vol compression=enable 0 2 * *

To check the compression ration after the transfer has been finished run:

> snapmirror status -l

And look at ‘Compression Ratio’ line:

Source: fas1:src
Destination: fas2:dest
Status: Transferring
Progress: 24 KB
Compression Ratio: 3.5 : 1

The one drawback of compression is an increased CPU load. Monitor your CPU load and if it’s too high, use compression selectively.