Posts Tagged ‘error’

HCX Perftest Issue

May 9, 2021

Introduction

VMware HCX is a great tool, which simplifies VM migrations between on-prem to on-prem or on-prem to cloud at scale. I’ve worked with many different VM migration tools before and what I particularly like about HCX is it’s ability to stretch network subnets between source and destination environments. It reduces (or completely removes) the need to re-IP VMs, which simplifies the migration and reduces the risk of inadvertently introducing issues into migrated applications.

Perftest Tool

HCX is a complex set of technologies and getting initial deployment right is key to building a reliable migration fabric. Perftest is a CLI tool available on interconnect (IX) and network extension (NE) HCX appliances, which allows you to perform validation testing to ensure everything is functioning correctly, as well as provide you a performance baseline. To run this tool you will need to SSH into HCX Manager, enter CCLI and then go to one of your IX or NE appliances:

# ccli
# list
# go 0
# perftest all

Issue Description

There is one issue you can come across, when running perftest, where it partially completes with the following errors:

Message Error: map[string]interface {}{“grpc_code”:14, “http_code”:503, “http_status”:”Service Unavailable”, “message”:”rpc error: code = Unavailable desc = transport is closing”}

and

Internal failure happens. Err: http.Post(https://appliance_ip:9443/perftest/stoptest) return statusCode: 503

Solution

The reason for this error is blocked connectivity on port TCP/4500. HCX uses ports UDP/4500 and UDP/500 for establishing tunnels between IX and NE appliance pairs, but that’s not enough for perftest.

In the very beginning of the perftest it gives you a hint, but it’s easy to overlook. This requirement is not well documented (at least at the time of writing), so keep that in mind next time you deploy HCX.

Advertisement

Joining ESXi to AD in Disjoint Namespace

November 4, 2019

What is Disjoint Namespace?

Typically, when using Microsoft Active Directory you use AD-integrated DNS and your AD domain name matches you DNS domain name, but you don’t have to. This is quite rare, but I’ve seen cases where the two don’t match. For example, you might have a Linux-based DNS, where you register an esx01.example.com DNS record for your ESXi host and then you join it to an Active Directory domain called corp.local.

That’s called a disjoint namespace. You can read this Microsoft article if you want to know more details: Disjoint Namespace.

In my personal opinion, using a disjoint namespace is asking for trouble, but it will still work if you really want to use it.

Problem

If you end up going down that route, there’s one caveat you should be aware of. When you joining a machine to AD, among other things, it needs to populate DNS name field property of the AD computer object. This is an example of ESXi computer object in Active Directory Users and Computers snap-in:

If you configure example.com domain in your ESXi Default TCP/IP stack, like so:

And then you try to, for example, join your ESXi host to corp.local AD domain, it will attempt to use esx-01a.example.com for computer object DNS name field. If you’re using a domain account with privileges restricted only to domain join, this operation will fail.

This is how the problem manifested itself in my case in ESXi host logs:

Failed to run provider specific request (request code = 8, provider = ‘lsa-activedirectory-provider’) -> error = 40315, symbol = LW_ERROR_LDAP_CONSTRAINT_VIOLATION, client pid = 2099303

If you’re using host profiles to join ESXi host to the domain, remediation will fail and you will see the following in /var/log/syslog.log:

WARNING: Domain join failed; retry count 1.

WARNING: Domain join failed; retry count 2.

Likewise (ActiveDirectory) Domain Join operation failed while joining new domain via username and password..

Note: this problem is specific to joining domain using a restricted service account. If you use domain administrator account, it will force the controller to add the computer object with a DNS name, which doesn’t match the AD name.

Solution

Make sure ESXi domain name setting matches the Active Directory domain name, not DNS domain name. You can still use the esx-01a.example.com record to add the ESXi hosts to vCenter, but you have to specify corp.local domain in DNS settings (or leave it blank), because this is what is going to be used to add the host to AD, like so:

This way your domain controller will be happy and ESXi host will successfully join the domain.

Additional Notes

While troubleshooting this issue I saw a few errors in ESXi host logs, which were a distraction, ignore them, as they don’t constitute an error.

This just means that the ESXi host Active Directory service is running, but host is not joined to a domain yet:

lsass: Failed to run provider specific request (request code = 12, provider = ‘lsa-activedirectory-provider’) -> error = 2692, symbol = NERR_SetupNotJoined, client pid = 2111366

IPC is inter-process communication. Likewise consists of multiple services that talk to each other. They open and close connections, this is normal:

lsass-ipc: (assoc:0x8ed7e40) Dropping: Connection closed by peer

I also found this command to be useful for deeper packet inspection between an ESXi host and AD domain controllers:

tcpdump-uw -i vmk0 not port 22 and not arp

References

Host Profile Customization Issue

November 1, 2019

vSphere Host Profiles is a great feature for consistent ESXi host configuration and compliance checks, but can at times be flaky.

I’ve noticed an issue recently with Host Profiles in vSphere 6.7, where after providing host customization values the following error is shown in vSphere Web Client:

The “Update host customizations” operation failed for the entity with the following error message.

Host settings validation failed.

This is how the error message looks like in the client:

Even though it’s a bit annoying, I found it to be a furphy. Customizations are actually saved successfully and the error can be ignored. You can find the following messages in ESXi host’s /var/log/syslog.log file, which confirm that it works:


INFO: Execute completed
INFO: Validating AnswerFile Status1 = success
INFO: Cleaned up Host Configuration
INFO: GetAnswerFile completed

I’ve also found that this error doesn’t appear when you provide host customization values first time straight after attaching a profile to the host. Only when you update them. It also doesn’t show up in HTML5, only Web Client. I guess, one more reason to switch to HTML5.

Hope this blog post helps someone who searched in Google, but couldn’t find any information related to this error message.

Unable to Delete vCenter Endpoint in vRealize Automation

December 7, 2018

vRealize Automation Error

More than once in my experience I’ve had a need to delete an endpoint in vRealize Automation. Maybe configuration has changed or you simply made a typo in vCenter hostname or credentials. Once you’ve specified vCenter address and saved the endpoint you can no longer change it (only delete and re-add).

But even when you try to delete it, you will get an error something along the lines of:

You cannot delete this endpoint because 1 compute resources and 0 storage paths use it.

CloudClient Error

There is a KB article that walks you through the process of how to do that using a special tool called CloudClient: Error “This endpoint is being used by # compute resources and # storage paths and cannot be deleted” when you attempt to delete an endpoint in vRA 7.x (2150548)

But even that approach not always work. When you run this command from the KB article “vra computeresource inactive list” you may get the following error:

Error: Something went wrong while processing your request. Please check the application logs for details.

Solution

There is almost no mention of this second error on the Internet and I can see how someone can keep banging his head trying to solve it, so I thought I’d share a solution here. And it’s simple – open a GSS ticket. They can delete the endpoint for you. If you see this error, there’s no other way that I know of to solve this problem without involving GSS.

Clean-up

You can see an error similar to the following in vRA logs if you didn’t stop proxy agents before deleting the endpoint:

Error processing ping response
Error occurred while executing stored proc usp_InsertUpdateHost The INSERT statement conflicted with the FOREIGN KEY constraint “FK_ManagementEndpoint_Host”. The conflict occurred in database “vRa_IaaS”, table “dbo.ManagementEndpoints”, column ‘ManagementEndpointID’.
The statement has been terminated.
Inner Exception: The INSERT statement conflicted with the FOREIGN KEY constraint “FK_ManagementEndpoint_Host”. The conflict occurred in database “vRa_IaaS”, table “dbo.ManagementEndpoints”, column ‘ManagementEndpointID’.
The statement has been terminated.

All you need to do to get rid of it is restart your proxy agents.

Conclusion

Hope this post saves someone the hassle of hours searching for the answer in blogs and forums.

Error When Deploying VCSA or PSC

October 31, 2017

Recently when helping a customer to deploy a new greenfield VMware 6.5 environment I ran into an issue where brand new vCenter Server Appliance and Platform Service Controller 6.5 build 5973321 fail to deploy to an ESXi host build 5969303.

Stage 1 (install) of the deployment completes successfully. In Stage 2 (setup) VCSA installer both for vCenter and PSC first shows a prompt asking for credentials.

PSC Issue Description

After providing credentials, when installing an external PSC, installation fails with the following error:

Error:
Unable to connect to vCenter Single Sign-On: Failed to connect to SSO; uri:https://psc-hostname/sts/STSService/vsphere.local
Failed to register vAPI Endpoint Service with CM
Failed to configure vAPI Endpoint Service at the firstboot time

Resolution:
Please file a bug against VAPI

Installation wizard shows the following resulting error:

Failure:
A problem occurred during setup. Refresh this page and try again.

A problem occurred during setup. Services might not be working as expected.

A problem occurred while – Starting VMware vAPI Endpoint…

Appliance shows the following error in console:

Failed to start services. Firstboot Error.

Alternatively PSC can fail with the following error:

Error:
Unexpected failure: }
Failed to register vAPI Endpoint Service with CM
Failed to configure vAPI Endpoint Service at the firstboot time

Resolution:
Please file a bug against VAPI

VCSA Issue Description

After providing credentials, when installing vCenter with embedded PSC, installation fails with the following error:

Error:
Unable to start the Service Control Agent.

Resolution:
Search for these symptoms in the VMware knowledge base for any known issues and possible workarounds. If none can be found, collect a support bundle and open a support request.

Installation wizard shows the following resulting error:

Failure:
A problem occurred during setup. Refresh this page and try again.

A problem occurred during setup. Services might not be working as expected.

A problem occurred while – Starting VMware Service Control Agent…

Appliance shows the same error in console.

Alternatively VCSA can fail with the following error:

Error:
Encountered an internal error. Traceback (most recent call last): File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 1852, in main vmidentityFB.boot() File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 359, in boot self.checkSTS(self.__stsRetryCount, self.__stsRetryInterval) File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 1406, in checkSTS raise Exception(‘Failed to initialize Secure Token Server.’) Exception: Failed to initialize Secure Token Server.

Resolution:
This is an unrecoverable error, please retry install. If you run into this error again, please collect a support bundle and open a support request.

Issue Workaround

This issue happens when VCSA or PSC installation was cancelled and is attempted for the second time to the same ESXi host.

Identified workaround for this issue is to use another ESXi host, which has never been used to deploy PSC or VCSA to.

Issue Resolution

VMware is aware of the bug and working on the resolution.

Force10 and vSphere vDS Interoperability Issue

June 10, 2016

dell-force10Recently I had an opportunity to work with Dell FX2 platform from the design and delivery point of view. I was deploying a FX2s chassis with FC630 blades and FN410S 10Gb I/O aggregators.

I ran into an interesting interoperability glitch between Force10 and vSphere distributed switch when using LLDP. LLDP is an equivalent of Cisco CDP, but is an open standard. And it allows vSphere administrators to determine which physical switch port a given vSphere distributed switch uplink is connected to. If you enable both Listen and Advertise modes, network administrators can get similar visibility, but from the physical switch side.

In my scenario, when LLDP was enabled on a vSphere distributed switch, uplinks on all ESXi hosts started disconnecting and connecting back intermittently, with log errors similar to this:

Lost uplink redundancy on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is down.

Network connectivity restored on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is up

Uplink redundancy restored on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is up

Issue Troubleshooting

FX2 I/O aggregator logs were reviewed for potential errors and the following log entries were found:

%STKUNIT0-M:CP %DIFFSERV-5-DSM_DCBX_PFC_PARAMETERS_MISMATCH: PFC Parameters MISMATCH on interface: Te 0/2

%STKUNIT0-M:CP %IFMGR-5-OSTATE_DN: Changed interface state to down: Te 0/2

%STKUNIT0-M:CP %IFMGR-5-OSTATE_UP: Changed interface state to up: Te 0/2

This clearly looks like some DCB negotiation issue between Force10 and the vSphere distributed switch.

Root Cause

Priority Flow Control (PFC) is one of the protocols from the Data Center Bridging (DCB) family. DCB was purposely built for converged network environments where you use 10Gb links for both Ethernet and FC traffic in the form of FCoE. In such scenario, PFC can pause Ethernet frames when FC is not having enough bandwidth and that way prioritise the latency sensitive storage traffic.

In my case NIC ports on Qlogic 57840 adaptors were used for 10Gb Ethernet and iSCSI and not FCoE (which is very uncommon unless you’re using Cisco UCS blade chassis). So the question is, why Force10 switches were trying to negotiate FCoE? And what did it have to do with enabling LLDP on the vDS?

The answer is simple. LLDP not only advertises the port numbers, but also the port capabilities. Data Center Bridging Exchange Protocol (DCBX) uses LLDP when conveying capabilities and configuration of FCoE features between neighbours. This is why enabling LLDP on the vDS triggered this. When Force10 switches determined that vDS uplinks were CNA adaptors (which was in fact true, I was just not using FCoE) it started to negotiate FCoE using DCBX. Which didn’t really go well.

Solution

The easiest solution to this problem is to disable DCB on the Force10 switches using the following command:

# conf t
# no dcb enable

Alternatively you can try and disable FCoE from the ESXi end by using the following commands from the host CLI:

# esxcli fcoe nic list
# esxcli fcoe nic disable -n vmnic0

Once FCoE has been disabled on all NICs, run the following command and you should get an empty list:

# esxcli fcoe adapter list

Conclusion

It is still not clear why PFC mismatch would cause vDS uplinks to start flapping. If switch cannot establish a FCoE connection it should just ignore it. Doesn’t seem to be the case on Force10. So if you run into a similar issue, simply disable DCB on the switches and it should fix it.

RecoverPoint VE: Common Deployment Issues

April 19, 2016

fixIn one of my previous posts I discussed iSCSI connectivity considerations when deploying RecoverPoint VE. In this post I want to describe common issues you may encounter when deploying RecoverPoint clusters, most of which are applicable to both physical appliance and virtual editions.

VNX MirrorView ports

I already touched on that briefly in my previous post. But it’s worth mentioning again that you can NOT use MirrorView ports for iSCSI connectivity between RPAs and VNX arrays. When you try to use a MirrorView iSCSI port for RecoverPoint, it gets upset and doesn’t communicate with the array.

If you make a mistake of connecting one port per SP and this port is a MirrorView port, you will have no communication with the array at all and get the following error in Unisphere for RecoverPoint:

Error Splitter ARRAYNAME-A is down
Error Splitter ARRAYNAME-B is down

splitter_error

If you connect two ports per SP, one of which is MirrorView port and use two iSCSI network subnets you may get the following error when running a SAN connectivity test from the RPA boxmgmt interface. In this case RPA can communicate with the array only over one subnet:

On array ABCD1234567890, all paths for device with UID=0x1234567890abcdef go through RPA Ethernet port eth2 …

multipathing_issue

The solution is as simple as moving the link from port 0 to port 1 on a 10Gb I/O module. And from port 0 to port 1,2 or 3 on a 1Gb I/O module.

If you don’t want to lose two iSCSI ports (1 per SP), especially if it’s 10Gb, and you’re not using MirrorView, you can uninstall MirrorView enabler from the array. Just keep in mind that it will require an array reboot. Service processors will be rebooted one at a time, so there is no downtime. But if it’s a heavily used storage array it’s recommended to schedule uninstallation out of hours to minimize the impact.

Error when redeploying a cluster

If you’ve made configuration mistakes while deploying a RecoverPoint cluster and want to blow the whole thing away and redeploy it from scratch you may encounter the following error when deploying for the second time:

VNX path set with IP 10.10.10.1 already exists in a different path set (RP_0x123abc456def789g_0_iSCSI1)

rpa_redeploy

The cause of the issue is iSCSI sessions which stayed on the VNX after you deleted RPA VMs. You need to connect to the VNX and delete them in Unisphere manually by right-clicking on the storage array name on the dashboard and selecting iSCSI > Connections Between Storage Systems. This is what duplicate sessions look like:

duplicate_rp

As you can see there’re three sets of RecoverPoint cluster iSCSI connections after three unsuccessful attempts.

You will need to delete old sessions before you are able to proceed with the deployment in RecoverPoint Deployment Manager.

Wrong initiator names

I’ve seen this on multiple occasions when RecoverPoint registers initiators on VNX with inconsistent hostnames.

As you’ve seen on the screenshots above, hostname field of every initiator consists of the cluster ID and RPA ID (not sure what the third field means), such as this:

RP_0x123abc456def789g_1_0

In this example you can see that RPA1 has two hostnames with suffixes _0_0 and _1_0.

wrong_initiators

This issue is purely cosmetic and doesn’t affect RecoverPoint operation, but if you want to fix it you will need to restart Management Servers on VNX service processors. It’s a non-disruptive procedure and can be performed by opening the following link http://SP_IP/setup and clicking on “Restart Management Server” button.

After a restart, array will update hostnames to reflect the actual configuration.

Joining two clusters with the licences already applied

This is just not going to work. Make sure to join production and DR clusters before applying RecoverPoint licences or Deployment Manager “Connect Cluster” wizard will fail.

It’s one of the prerequisites specified in RecoverPoint “Installation and Deployment Guide”:

If you plan to connect the new cluster immediately after preparing it for connection,
ensure:

  • You do not install a license in, or modify the settings of, the new cluster before
    connecting it to the existing system.

Conclusion

There’re always much more things that can potentially go wrong. But if any of the above helped you to solve your RecoverPoint deployment issues make sure to let me know in the comments below!

Issue Joining VNX1 and VNX2 Unisphere Domains

March 21, 2016

no_SSLThe main benefit of using Unisphere Domains is that they give you ability to manage all of your VNXs by connecting to just one array. If you have an old Clariion you’ll have to use a so called Multi-Domain. VNX1 and VNX2 arrays can join a single domain.

Recently I’ve encountered an issue where this didn’t work quite well. When joining VNX1 to a VNX2 I got the following error:

CIMOM Can’t get the VNX hardware class from – ip 172.10.10.10 – Error Connecting SSL. Error details: A system call error (errno=10057).

join_error

As it turned out EMC disabled SSL 3.0 support in recent Block OE versions. As a result it’s broken Unisphere Domain connectivity with arrays running Flare 32 Patch 209 or older, which still use SSL 3.0.

Solution is to upgrade Block OE to a version higher than Flare 32 Patch 209 where SSL 3.0 is disabled. Or as a workaround you can connect arrays in a Multi-Domain. To find out how, read one of my earlier blog posts: How to Configure VNX Unisphere Domains

vSphere 6 VM Tools Installation Fails

January 15, 2016

Today I have encountered an issue with vSphere 6 VMware Tools when installing on a Windows Server 2008 R2 VM. Installation fails with the following error:

Service “VMware Alias Manager and Ticket Service’ (VGAuthService) failed to start. Verify that you have sufficient privileges to start system services.

The following error appears in the Application logs:

Activation context generation failed for “C:\ Program Files\ VMware\ VMware Tools\ VMware VGAuth\ VGAuthService.exe”. Dependent Assembly Microsoft.VC90.CRT, processorArchitecture=”amd64″, publicKeyToken=”1fc8b3b9a1e18e3b”, type=”win32″, version=”9.0.30729.4148″ could not be found. Please use sxstrace.exe for detailed diagnosis.

Microsoft.VC90.CRT is a Microsoft Visual C++ 2008 Runtime, which VMware Tools depend on.

vmtools_error

To fix the issue, reinstall the  Microsoft Visual C++ 2008 SP1 Redistributable packages, both x86 and x64 versions and the problem should go away.

This issue is unlikely to be specific to vSphere 6, because there is a VMware KB which describes a similar problem for an older version of the runtime:

But just in case, here is the environment setup:

  • Windows Server 2008 R2
  • vSphere vCenter 6 Update 1, b3018524
  • vSphere ESXi 6 Update 1a, b3073146
  • VMware Tools 9.10.5, b2981885

vSphere Dump / Syslog Collector: PowerCLI Script

March 12, 2015

Overview

If you install ESXi hosts on say 2GB flash cards in your blades which are smaller than required 6GB, then you won’t have what’s called persistent storage on your hosts. Both your kernel dumps and logs will be kept on RAM drive and deleted after a reboot. Which is less than ideal.

You can use vSphere Dump Collector and Syslog Collector to redirect them to another host. Usually vCenter machine, if it’s not an appliance.

If you have a bunch of ESXi hosts you’ll have to manually go through each one of them to set the settings, which might be a tedious task. Syslog can be done via Host Profiles, but Enterprise Plus licence is not a very common things across the customers. The simplest way is to use PowerCLI.

Amendments to the scripts

These scripts originate from Mike Laverick’s blog. I didn’t write them. Original blog post is here: Back To Basics: Installing Other Optional vCenter 5.5 Services.

The purpose of my post is to make a few corrections to the original Syslog script, as it has a few mistakes:

First – typo in system.syslog.config.set() statement. It requires additional $null argument before the hostname. If you run it as is you will probably get an error which looks like this.

Message: A specified parameter was not correct.
argument[0];
InnerText: argument[0]

Second – you need to open outgoing syslog ports, otherwise traffic won’t flow. It seems that Dump Collector traffic is enabled by default even though there is no rule for it in the firewall (former netDump rule doesn’t exist anymore). Odd, but that’s how it is. Syslog on the other hand requires explicit rule, which is reflected in the script by network.firewall.ruleset.set() command.

Below are the correct versions of both scripts. If you copy and paste them everything should just work.

vSphere Dump Collector

Foreach ($vmhost in (get-vmhost))
{
$esxcli = Get-EsxCli -vmhost $vmhost
$esxcli.system.coredump.network.get()
}

Foreach ($vmhost in (get-vmhost))
{
$esxcli = Get-EsxCli -vmhost $vmhost
$esxcli.system.coredump.network.set($null, “vmk0”, “10.0.0.1”, “6500”)
$esxcli.system.coredump.network.set($true)
}

vSphere Syslog Collector

Foreach ($vmhost in (get-vmhost))
{
$esxcli = Get-EsxCli -vmhost $vmhost
$esxcli.system.syslog.config.get()
}

Foreach ($vmhost in (get-vmhost))
{
$esxcli = Get-EsxCli -vmhost $vmhost
$esxcli.system.syslog.config.set($null, $null, $null, $null, $null, “udp://10.0.0.1:514”)
$esxcli.network.firewall.ruleset.set($null, $true, “syslog”)
$esxcli.system.syslog.reload()
}