Posts Tagged ‘SRM’

vRealize Automation Disaster Recovery

January 14, 2018

Introduction

VMware has invested a lot of time and effort in vRealize Automation high availability. For medium and large deployment scenarios VMware recommends using a load balancer (Citrix, F5, NSX) to distribute traffic between vRA appliance and infrastructure components, as well as database clustering (such as MS SQL availability groups) for database high availability. Additionally, in vRA 7.3 VMware added support for automatic failover of vRA appliance’s embedded PostgreSQL database, which was a manual process prior to that.

There is a clear distinction, however, between high availability and disaster recovery. Generally speaking, HA covers redundancy within the site and is not intended to protect from full site failure. Site Recovery Manager (or another replication product) is required to protect vRA in a DR scenario, which is described in more detail in the following document:

In my opinion, there are two important aspects that are missing from the aforementioned document, which I want to cover in this blog post: restoring VM UUIDs and changing vRA IP address. I will cover them in the order that these tasks would usually be performed if you were to fail over vRA to DR:

  1. Exporting VM UUIDs
  2. Changing IP addresses
  3. Importing VM UUIDs

I will also only touch on how to change VM reservations. Which is also an important step, but very well covered in VMware documentation already.

Note: this blog post does not provide configuration guidelines for VM replication software, such as Site Recovery Manager, Zerto or RecoverPoint and is focused only on DR aspects related to vRA itself. Refer to official documentation of corresponding products to determine how to set up VM replication to your disaster recovery site.

Exporting VM UUIDs

VMware uses two UUIDs to identify a VM. BIOS UUID (uuid.bios in .vmx file) was the original VM identifier implemented to identify a VM and is derived from the hardware VM is provisioned on. But it’s not unique. If VM is cloned, the clone will have the same BIOS UUID. So the second identifier was introduced called Instance UUID (vc.uuid in .vmx file), which is generated by vCenter and is unique within a single vCenter (two VMs in different vCenters can have the same Instance UUID).

When VMs are failed over, Instance UUIDs change. Compare VirtualMachine.Admin.AgentID (Instance UUID) and VirtualMachine.Admin.UUID (BIOS UUID) on original and failed over VMs.

Why does this matter? Because vRA uses Instance UUIDs to keep track of managed VMs.  If Instance UUIDs change, vRA will show the corresponding VMs as missing under Infrastructure > Managed Machines. And you won’t be able to manage them.

So it’s important to export VM Instance UUIDs before failover, which can then be used to restore the original values. This is how you can get the Instance UUID of a given VM using PowerCLI:

> (Get-VM vm_name).extensiondata.config.InstanceUUID

Here, on my GitHub page, you can find a script that I have put together to export Instance UUIDs of all VMs in CSV format.

Changing IP addresses

Once you’ve saved the Instance UUIDs, you can move on to failover. vRA components should be started in the following order:

  1. MS SQL database
  2. vRA appliance
  3. IaaS server

If network subnets, that all components are connected to, are stretched between two sites, when VMs are brought up at DR, there are no additional reconfiguration required. But usually it’s not the case and servers need to be re-IP’ed. IaaS server network setting are changed the same as on any other Windows server machine.

vRealize Appliance network settings are changed in vRA appliance management interface, that can be accessed at https://vra-appliance-hostname:5480, under Network > Address tab. The problem is, if IP addresses change at DR, it will be challenging to reach vRA appliance over the network. To work around that, connect to vRA VM console and run the following script from CLI to change appliance’s network settings:

# /opt/vmware/share/vami/vami_config_net

Don’t forget to update the DNS record for vRA appliance in DNS. For IaaS server it’s not needed, as long as you allow Dynamic DNS (DDNS) updates.

Importing VM UUIDs

After the failover all of your VMs will have missing status in vRA. To make vRA recognize failed over VMs you will need to revert Instance UUIDs back to the original values. In PowerCLI this can be done in the following way:

> $spec = New-Object VMware.Vim.VirtualMachineConfigSpec
> $spec.instanceUuid = ’52da9b14-0060-dc51-4733-3b01e912edd2′
> $vm = Get-VM -Name vm_name
> $vm.Extensiondata.ReconfigVM_Task($spec)

I’ve written another script, that will perform this task for you, which you can find on my GitHub page.

You will need two files to make the script work. The vm_vc_uuids.csv file you generated before, with the list of original VM Instance UUIDs. As well as the list of missing VMs in CSV format, that you can export from vRA after the failover on the Infrastructure > Managed Machines page:

This is an example of the script command line options and the output:

You will need to run an inventory data collection from the Infrastructure > Compute Resources > Compute Resources page. vRA will discover VMs and update their status to “On”.

Updating reservations

If you try to run any Day 2 operation on a VM with the old reservation in place, you will get an error similar to this:

Error processing [Shutdown], error details:
Error getting property ‘runtime.powerState’ from managed object (null)
Inner Exception: Object reference not set to an instance of an object.

To manually update VM reservation, on Infrastructure > Managed Machines page hover over the VM and select Change Reservation:

This process is obviously not scalable, as it can take hours, if you have hundreds of VMs. VMware offers an alternative solution that lets you update all VMs by using Bulk Import feature available from Infrastructure > Administration > Bulk Imports. The idea is that you can export all VM configuration details in a CSV file, update compute and storage reservation columns and import back to vRA. vRealize Suite 7.0 Disaster Recovery by Using Site Recovery Manager 6.1 gives very detailed instruction on how to do that in “Bulk Import, Update, or Migrate Virtual Machines” section.

Conclusion

I hope this blog post helped to cover some gaps in VMware documentation. If you have any questions or comments, as always, feel free to leave them in the comments sections below.

References

Advertisement

Zerto Overview

March 6, 2014

zerto-logoZerto is a VM replication product which works on a hypervisor level. In contrast to array level replication, which SRM has been using for a long time, it eliminates storage array from the equation and all the complexities which used to come along with it (SRAs, splitting the LUNs for replicated and non-replicated VMs, potential incompatibilities between the orchestrated components, etc).

Basic Operation

Zerto consists of two components: ZVM (Zerto Virtual Manger) and VRA (Virtual Replication Appliance). VRAs are VMs that need to be installed on each ESXi host within the vCenter environment (performed in automated fashion from within ZVM console). ZVM manages VRAs and all the replication settings and is installed one per vCenter. VRA mirrors protected VMs I/O operations to the recovery site. VMs are grouped in VPGs (Virtual Protection Groups), which can be used as a consistency group or just a container.

Protected VMs can be preseeded  to DR site. But what Zerto essentially does is it replicates VM disks to any datastore on recovery site where you point it to and then tracks changes in what is called a journal volume. Journal is created for each VM and is kept as a VMDK within the “ZeRTO volumes” folder on a target datastore. Every few seconds Zerto creates checkpoints on a journal, which serve as crash consistent recovery points. So you can recover to any point in time, with a few seconds granularity. You can set the journal length in hours, depending on how far you potentially would want to go back. It can be anywhere between 1 and 120 hours.Data-Replication-over-WAN

VMs are kept unregistered from vCenter on DR site and VM configuration data is kept in Zerto repository. Which essentially means that if an outage happens and something goes really wrong and Zerto fails to bring up VMs on DR site you will need to recreate VMs manually. But since VMDKs themselves are kept in original format you will still be able to attach them to VMs and power them on.

Failover Scenarios

There are four failover scenarios within Zerto:

  • Move Operation – VMs are shut down on production site, unregistered from inventory, powered on at DR site and protection is reversed if you decide to do so. If you choose not to reverse protection, VMs are completely removed from production site and VPG is marked as “Needs Configuration”. This scenario can be seen as a planned migration of VMs between the sites and needs both sites to be healthy and operational.
  • Failover Operation – is used in disaster scenario when production site might be unavailable. In this case Zerto brings up protected VMs on DR site, but it does not try to remove VMs from production site inventory and leave them as is. If production site is still accessible you can optionally select to shutdown VMs. You cannot automatically reverse protection in this scenario, VPG is marked as “Needs Configuration” and can be activated later. And when it is activated, Zerto does all the clean up operations on the former production site: shuts down VMs (if they haven’t been already), unregister them from inventory and move to VRA folder on the datastore.
  • Failover Test Operation – this is for failover testing and brings up VMs on DR site in a configured bubble network which is normally not uplinked to any physical network. VMs continue to run on both sites. Note that VMs disk files in this scenario are not moved to VMs folders (as in two previous scenarios) and are just connected from VRA VM folder. You would also notice that Zerto created second journal volume which is called “scratch” journal. Changes to the VM that is running on DR site are saved to this journal while it’s being tested.
  • Clone Operation – VMs are cloned on DR site and connected to network. VMs are not automatically powered on to prevent potential network conflicts. This can be used for instance in DR site testing, when you want to check actual networking connectivity, instead of connecting VMs to an isolated network. Or for implementing backups, cloned environment for applications testing, etc.

Zerto Journal Sizing

By default journal history is configured as 4 hours and journal size is unlimited. Depending on data change rate within the VM journal can be smaller or larger. 15GB is approximately enough storage to support a virtual machine with 1TB of storage, assuming a 10% change rate per day with four hours of journal history saved. Zerto has a Journal Sizing Tool which helps to size journals. You can create a separate journal datastore as well.

Zerto compared to VMware Replication and SRM

There are several replication products in the market from VMware. Standalone VMware replication, VMware replication + SRM orchestraion and SRM array-based replication. If you want to know more on how they compare to Zerto, you can read the articles mentioned in references below. One apparent Zerto advantage, which I want to mention here, is integration with vCloud Director, which is essential for cloud providers who offer DRaaS solutions. SRM has no vCloud Director support.

References