Posts Tagged ‘failure’

vRealize Automation Disaster Recovery

January 14, 2018

Introduction

VMware has invested a lot of time and effort in vRealize Automation high availability. For medium and large deployment scenarios VMware recommends using a load balancer (Citrix, F5, NSX) to distribute traffic between vRA appliance and infrastructure components, as well as database clustering (such as MS SQL availability groups) for database high availability. Additionally, in vRA 7.3 VMware added support for automatic failover of vRA appliance’s embedded PostgreSQL database, which was a manual process prior to that.

There is a clear distinction, however, between high availability and disaster recovery. Generally speaking, HA covers redundancy within the site and is not intended to protect from full site failure. Site Recovery Manager (or another replication product) is required to protect vRA in a DR scenario, which is described in more detail in the following document:

In my opinion, there are two important aspects that are missing from the aforementioned document, which I want to cover in this blog post: restoring VM UUIDs and changing vRA IP address. I will cover them in the order that these tasks would usually be performed if you were to fail over vRA to DR:

  1. Exporting VM UUIDs
  2. Changing IP addresses
  3. Importing VM UUIDs

I will also only touch on how to change VM reservations. Which is also an important step, but very well covered in VMware documentation already.

Note: this blog post does not provide configuration guidelines for VM replication software, such as Site Recovery Manager, Zerto or RecoverPoint and is focused only on DR aspects related to vRA itself. Refer to official documentation of corresponding products to determine how to set up VM replication to your disaster recovery site.

Exporting VM UUIDs

VMware uses two UUIDs to identify a VM. BIOS UUID (uuid.bios in .vmx file) was the original VM identifier implemented to identify a VM and is derived from the hardware VM is provisioned on. But it’s not unique. If VM is cloned, the clone will have the same BIOS UUID. So the second identifier was introduced called Instance UUID (vc.uuid in .vmx file), which is generated by vCenter and is unique within a single vCenter (two VMs in different vCenters can have the same Instance UUID).

When VMs are failed over, Instance UUIDs change. Compare VirtualMachine.Admin.AgentID (Instance UUID) and VirtualMachine.Admin.UUID (BIOS UUID) on original and failed over VMs.

Why does this matter? Because vRA uses Instance UUIDs to keep track of managed VMs.  If Instance UUIDs change, vRA will show the corresponding VMs as missing under Infrastructure > Managed Machines. And you won’t be able to manage them.

So it’s important to export VM Instance UUIDs before failover, which can then be used to restore the original values. This is how you can get the Instance UUID of a given VM using PowerCLI:

> (Get-VM vm_name).extensiondata.config.InstanceUUID

Here, on my GitHub page, you can find a script that I have put together to export Instance UUIDs of all VMs in CSV format.

Changing IP addresses

Once you’ve saved the Instance UUIDs, you can move on to failover. vRA components should be started in the following order:

  1. MS SQL database
  2. vRA appliance
  3. IaaS server

If network subnets, that all components are connected to, are stretched between two sites, when VMs are brought up at DR, there are no additional reconfiguration required. But usually it’s not the case and servers need to be re-IP’ed. IaaS server network setting are changed the same as on any other Windows server machine.

vRealize Appliance network settings are changed in vRA appliance management interface, that can be accessed at https://vra-appliance-hostname:5480, under Network > Address tab. The problem is, if IP addresses change at DR, it will be challenging to reach vRA appliance over the network. To work around that, connect to vRA VM console and run the following script from CLI to change appliance’s network settings:

# /opt/vmware/share/vami/vami_config_net

Don’t forget to update the DNS record for vRA appliance in DNS. For IaaS server it’s not needed, as long as you allow Dynamic DNS (DDNS) updates.

Importing VM UUIDs

After the failover all of your VMs will have missing status in vRA. To make vRA recognize failed over VMs you will need to revert Instance UUIDs back to the original values. In PowerCLI this can be done in the following way:

> $spec = New-Object VMware.Vim.VirtualMachineConfigSpec
> $spec.instanceUuid = ’52da9b14-0060-dc51-4733-3b01e912edd2′
> $vm = Get-VM -Name vm_name
> $vm.Extensiondata.ReconfigVM_Task($spec)

I’ve written another script, that will perform this task for you, which you can find on my GitHub page.

You will need two files to make the script work. The vm_vc_uuids.csv file you generated before, with the list of original VM Instance UUIDs. As well as the list of missing VMs in CSV format, that you can export from vRA after the failover on the Infrastructure > Managed Machines page:

This is an example of the script command line options and the output:

You will need to run an inventory data collection from the Infrastructure > Compute Resources > Compute Resources page. vRA will discover VMs and update their status to “On”.

Updating reservations

If you try to run any Day 2 operation on a VM with the old reservation in place, you will get an error similar to this:

Error processing [Shutdown], error details:
Error getting property ‘runtime.powerState’ from managed object (null)
Inner Exception: Object reference not set to an instance of an object.

To manually update VM reservation, on Infrastructure > Managed Machines page hover over the VM and select Change Reservation:

This process is obviously not scalable, as it can take hours, if you have hundreds of VMs. VMware offers an alternative solution that lets you update all VMs by using Bulk Import feature available from Infrastructure > Administration > Bulk Imports. The idea is that you can export all VM configuration details in a CSV file, update compute and storage reservation columns and import back to vRA. vRealize Suite 7.0 Disaster Recovery by Using Site Recovery Manager 6.1 gives very detailed instruction on how to do that in “Bulk Import, Update, or Migrate Virtual Machines” section.

Conclusion

I hope this blog post helped to cover some gaps in VMware documentation. If you have any questions or comments, as always, feel free to leave them in the comments sections below.

References

Advertisement

Error When Deploying VCSA or PSC

October 31, 2017

Recently when helping a customer to deploy a new greenfield VMware 6.5 environment I ran into an issue where brand new vCenter Server Appliance and Platform Service Controller 6.5 build 5973321 fail to deploy to an ESXi host build 5969303.

Stage 1 (install) of the deployment completes successfully. In Stage 2 (setup) VCSA installer both for vCenter and PSC first shows a prompt asking for credentials.

PSC Issue Description

After providing credentials, when installing an external PSC, installation fails with the following error:

Error:
Unable to connect to vCenter Single Sign-On: Failed to connect to SSO; uri:https://psc-hostname/sts/STSService/vsphere.local
Failed to register vAPI Endpoint Service with CM
Failed to configure vAPI Endpoint Service at the firstboot time

Resolution:
Please file a bug against VAPI

Installation wizard shows the following resulting error:

Failure:
A problem occurred during setup. Refresh this page and try again.

A problem occurred during setup. Services might not be working as expected.

A problem occurred while – Starting VMware vAPI Endpoint…

Appliance shows the following error in console:

Failed to start services. Firstboot Error.

Alternatively PSC can fail with the following error:

Error:
Unexpected failure: }
Failed to register vAPI Endpoint Service with CM
Failed to configure vAPI Endpoint Service at the firstboot time

Resolution:
Please file a bug against VAPI

VCSA Issue Description

After providing credentials, when installing vCenter with embedded PSC, installation fails with the following error:

Error:
Unable to start the Service Control Agent.

Resolution:
Search for these symptoms in the VMware knowledge base for any known issues and possible workarounds. If none can be found, collect a support bundle and open a support request.

Installation wizard shows the following resulting error:

Failure:
A problem occurred during setup. Refresh this page and try again.

A problem occurred during setup. Services might not be working as expected.

A problem occurred while – Starting VMware Service Control Agent…

Appliance shows the same error in console.

Alternatively VCSA can fail with the following error:

Error:
Encountered an internal error. Traceback (most recent call last): File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 1852, in main vmidentityFB.boot() File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 359, in boot self.checkSTS(self.__stsRetryCount, self.__stsRetryInterval) File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 1406, in checkSTS raise Exception(‘Failed to initialize Secure Token Server.’) Exception: Failed to initialize Secure Token Server.

Resolution:
This is an unrecoverable error, please retry install. If you run into this error again, please collect a support bundle and open a support request.

Issue Workaround

This issue happens when VCSA or PSC installation was cancelled and is attempted for the second time to the same ESXi host.

Identified workaround for this issue is to use another ESXi host, which has never been used to deploy PSC or VCSA to.

Issue Resolution

VMware is aware of the bug and working on the resolution.

Overview of NetApp Replication and HA features

August 9, 2013

NetApp has quite a bit of features related to replication and clustering:

  • HA pairs (including mirrored HA pairs)
  • Aggregate mirroring with SyncMirror
  • MetroCluster (Fabric and Stretched)
  • SnapMirror (Sync, Semi-Sync, Async)

It’s easy to get lost here. So lets try to understand what goes where.

Simple-Metrocluster

SnapMirror

SnapMirror is a volume level replication, which normally works over IP network (SnapMirror can work over FC but only with FC-VI cards and it is not widely used).

Asynchronous version of SnapMirror replicates data according to schedule. SnapMiror Sync uses NVLOGM shipping (described briefly in my previous post) to synchronously replicate data between two storage systems. SnapMirror Semi-Sync is in between and synchronizes writes on Consistency Point (CP) level.

SnapMirror provides protection from data corruption inside a volume. But with SnapMirror you don’t have automatic failover of any sort. You need to break SnapMirror relationship and present data to clients manually. Then resynchronize volumes when problem is fixed.

SyncMirror

SyncMirror mirror aggregates and work on a RAID level. You can configure mirroring between two shelves of the same system and prevent an outage in case of a shelf failure.

SyncMirror uses a concept of plexes to describe mirrored copies of data. You have two plexes: plex0 and plex1. Each plex consists of disks from a separate pool: pool0 or pool1. Disks are assigned to pools depending on cabling. Disks in each of the pools must be in separate shelves to ensure high availability. Once shelves are cabled, you enable SyncMiror and create a mirrored aggregate using the following syntax:

> aggr create aggr_name -m -d disk-list -d disk-list

HA Pair

HA Pair is basically two controllers which both have connection to their own and partner shelves. When one of the controllers fails, the other one takes over. It’s called Cluster Failover (CFO). Controller NVRAMs are mirrored over NVRAM interconnect link. So even the data which hasn’t been committed to disks isn’t lost.

MetroCluster

MetroCluster provides failover on a storage system level. It uses the same SyncMirror feature beneath it to mirror data between two storage systems (instead of two shelves of the same system as in pure SyncMirror implementation). Now even if a storage controller fails together with all of its storage, you are safe. The other system takes over and continues to service requests.

HA Pair can’t failover when disk shelf fails, because partner doesn’t have a copy to service requests from.

Mirrored HA Pair

You can think of a Mirrored HA Pair as HA Pair with SyncMirror between the systems. You can implement almost the same configuration on HA pair with SyncMirror inside (not between) the system. Because the odds of the whole storage system (controller + shelves) going down is highly unlike. But it can give you more peace of mind if it’s mirrored between two system.

It cannot failover like MetroCluster, when one of the storage systems goes down. The whole process is manual. The reasonable question here is why it cannot failover if it has a copy of all the data? Because MetroCluster is a separate functionality, which performs all the checks and carry out a cutover to a mirror. It’s called Cluster Failover on Disaster (CFOD). SyncMirror is only a mirroring facility and doesn’t even know that cluster exists.

Further Reading

VMware ESXi Core Processes

July 12, 2013

vmware_esxiThere are not much information on VMware ESXi architecture out there. There is an old whitepaper “The Architecture of VMware ESXi” which dates back to 2007. In fact, from the management perspective there are only two OS processes critical two ESXi management. These are: hostd (vmware-hostd in ESXi 4) and vpxa (vmware-vpxa in ESXi 4) which are called “Management Agents”.

hostd process is a communication layer between vmkernel and the outside world. It invokes all management operations on VMs, storage, network, etc by directly talking to the OS kernel. If hostd dies you won’t be able to connect to the host neither from vCenter nor VI client. But you will still have an option to connect directly to the host console via SSH.

vpxa is an intermediate layer between vCenter and hostd and is called “vCenter Server Agent”. If you have communication failure between vCenter and ESXi host, it’s the first thing to blame.

Say you have a storage LUN failure and hostd can’t release the datastore. You can restart hostd or both of the processes using these scripts:

# vmware-hostd restart
# vmware-vpxa restart

DFS Replication Troubleshooting

June 25, 2013

conceptual 3d rendered image of arrow isolated on whiteDFS Replication service doesn’t give you much information on how it’s replicating. It’s good to know some general commands to troubleshoot communication and data transfer issues.

Useful Commands

In Windows Server 2008 a new command was introduced to check what DFSR is doing at the moment. You won’t find it in Windows Server 2003:

> dfsrdiag replicationstate

If replication link isn’t feeling well you get lots of files in the backlog. To check if you have a backlog, run:

> dfsrdiag backlog /rgname:rgroup_name /rfname:folder_name /sendingmember:sending_server /receivingmember:receiving_server

If there are heaps of files in the backlog the best way to find the reason for it is to simply check the logs. DFSR logs are located in C:\Windows\debug. To get the most verbose information change the log severity level:

> wmic /namespace:\\root\microsoftdfs path dfsrmachineconfig set debuglogseverity=5

DFSR uses GUIDs to identify the replicated files, which look like: AC759213-00AF-4578-9C6E-EA0764FDC9AC. To get the meaningful data from the GUID use:

> dfsrdiag guid2name /guid:guid_identifier /rgname:group_name

There is one more command which allows you to find the exact path to the file in question. You should feed the uid field from the DFSR debug log to this command, which looks like {9EBE0A27-8AA9-4263-B942-DA9A92F30671}-v240880:

> wmic.exe /namespace:\\root\microsoftdfs path dfsridrecordinfo.Uid=”uid_identifier” call getfullfilepath

Sample Errors

1. When replicating between Windows Server 2008 R2 and Windows Server 2003 R2. On the source: “Ghosting is not enabled”. On the destination: “A failure was reported by the remote partner”.

I solved this error by applying the following patch: KB2462352. The reason for the issue is incompatibilities between protocol implementations.

2. The following error pops up in logs: “The system cannot find the file specified”.

Solution is described in KB951010. In Windows Server 2003 ConflictAndDeleted folder sometimes fills up above the 660MB quota and ConflictAndDeletedManifest.xml file may get corrupted. To solve the problem you need to cleanup the folder and delete the file by issuing:

> wmic /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where “replicatedfolderguid='<GUID>'” call cleanupconflictdirectory

To get the GUIDs of replicated folders run:

> wmic /namespace:\\root\microsoftdfs path dfsrreplicatedfolderconfig get replicatedfolderguid,replicatedfoldername

3. Near 100% CPU usage and the same error is written millions of times in the log files: “Failed to create stage file for GVSN gvsn_identitifer”.

I solved this issue by looking for the file specified by gvsn_identifier, which looks like {2ED37126-12C7-4617-AE6B-34509F467FEB}-v20748 and deleting it. These are files that are located in the staging folder.

Other Hepful Tools

You can create a Health Report from the DFS Management Console to see how many files have been transfered between replication members since the DFS service start. And if there are any DFS errors in the members’ event logs.

You can also use DFSRMon tool. But I personally don’t find it very useful.

Replacing hard drives in a NetApp aggregate

May 30, 2013

netapp_disk_driveNetApp uses certain rules to assign hot spares in case of a failure. It always tries to use the exact match, but if it’s not there, the best available spare is used. “The best” means that if you have an aggregate which consists of 1TB hard drives and you have only 2TB spare left, then this 2TB spare will be downsized to 1TB and used as a data disk. After that, when you receive a correct size replacement from NetApp, you need to exchange the downsized 2TB hard drive with the delivered 1TB spare. To accomplish that, use the following command:

> disk replace start disk_name spare_disk_name

It will take considerable amount of time to copy the data. In my case it was 6.5 hours for a 1TB drive.

When the process finishes, replaced drive becomes a new spare. It’s wise to zero it out right away, so that it could be easily used again as a spare. Otherwise when time comes you’ll be waiting hours before it could be added in place of the failed drive:

> disk zero spares

As a side note I want to mention that you cannot take disks out of the raid group. There is no way to shrink aggregates. The only thing you can make is to replace a hard drive with another one.