Posts Tagged ‘Problem’

HCX Perftest Issue

May 9, 2021

Introduction

VMware HCX is a great tool, which simplifies VM migrations between on-prem to on-prem or on-prem to cloud at scale. I’ve worked with many different VM migration tools before and what I particularly like about HCX is it’s ability to stretch network subnets between source and destination environments. It reduces (or completely removes) the need to re-IP VMs, which simplifies the migration and reduces the risk of inadvertently introducing issues into migrated applications.

Perftest Tool

HCX is a complex set of technologies and getting initial deployment right is key to building a reliable migration fabric. Perftest is a CLI tool available on interconnect (IX) and network extension (NE) HCX appliances, which allows you to perform validation testing to ensure everything is functioning correctly, as well as provide you a performance baseline. To run this tool you will need to SSH into HCX Manager, enter CCLI and then go to one of your IX or NE appliances:

# ccli
# list
# go 0
# perftest all

Issue Description

There is one issue you can come across, when running perftest, where it partially completes with the following errors:

Message Error: map[string]interface {}{“grpc_code”:14, “http_code”:503, “http_status”:”Service Unavailable”, “message”:”rpc error: code = Unavailable desc = transport is closing”}

and

Internal failure happens. Err: http.Post(https://appliance_ip:9443/perftest/stoptest) return statusCode: 503

Solution

The reason for this error is blocked connectivity on port TCP/4500. HCX uses ports UDP/4500 and UDP/500 for establishing tunnels between IX and NE appliance pairs, but that’s not enough for perftest.

In the very beginning of the perftest it gives you a hint, but it’s easy to overlook. This requirement is not well documented (at least at the time of writing), so keep that in mind next time you deploy HCX.

Advertisement

Load Balancing Ansible Tower Using NSX

February 1, 2020

Disclamer: this configuration is not validated by either VMware or Red Hat. Make sure it is applicable to your use case and thoroughly test before implementing in production.

Overview

If you landed on this page I trust you already know what Ansible is. It’s a great configuration management tool centred around using YAML to describe the desired state configuration of your various infrastructure components. This desired state is captured in what Ansible calls playbooks, which once written, can then be used in a repeatable way to deploy brand new components or enforce configuration on already deployed ones.

Ansible can be installed and used from CLI, which is usually a good starting point. If you have multiple people using Ansible in your organization, you can also deploy AWX. It’s a free GUI add-on to Ansible, which makes managing concurrent user access to Ansible easier, by adding projects, schedules and credentials management. On top of that there is Ansible Tower. Ansible Tower is a paid version of AWX and gives you additional enterprise features and services like clustering, product support, validated upgrade paths, etc. In this article we will be focusing on Ansible Tower version of the product.

Also worth mentioning that this configuration will be based on Ansible Tower cluster feature, which lets you run all nodes as active/active. Prior to version 3.1 it was called redundancy and worked only in active/passive mode. Redundancy feature is deprecated and is outside the scope of this blog post.

Topology

Deploying multiple Ansible Tower nodes in a cluster already gives you redundancy. If one of the nodes fails you can connect to another node, by just changing your browser URL. The benefit of having a load balancer is that you have one URL you can hit and if a node goes down, such situation is handled by load balancer automatically.

In this example we will be deploying a VMware NSX load-balancer in the following topology:

Configuration

Deploying an NSX load-balancer for HTTPS port 443 is simple, you can find numerous examples of how to create application profiles, monitors, pools and VIPs in official VMware documentation or out on the Internet. But with Ansible there’s one catch. If you try to use the default HTTPS monitor that NSX load balancer comes with, you will find HTTP 400 code in Ansible nginx logs:

10.20.30.40 - - [20/Jan/2020:04:50:19 +0000] "GET / HTTP/1.0" 400 3786 "-" "-" "-"
10.20.30.40 - - [20/Jan/2020:04:50:24 +0000] "GET / HTTP/1.0" 400 3786 "-" "-" "-"
10.20.30.40 - - [20/Jan/2020:04:50:29 +0000] "GET / HTTP/1.0" 400 3786 "-" "-" "-"

And an error in NSX load balancer health check:

As it turns out, when you make a HTTP request to Ansible Tower, specifying HTTP “Host” header is a requirement. Host header simply contains the hostname of the server you’re making a request to. Browsers add this header automatically, that’s why you’re not going to see any errors, when accessing Ansible Tower Using Firefox or Chrome. But NSX doesn’t add this header to the monitor checks by default, which makes Ansible Tower upset.

Here is the trick you need to do to make Tower happy:

Now nginx logs show success code 200:

10.20.30.40 - - [21/Jan/2020:22:54:42 +0000] "GET / HTTP/1.0" 200 11337 "-" "-" "-"
10.20.30.40 - - [21/Jan/2020:22:54:47 +0000] "GET / HTTP/1.0" 200 11337 "-" "-" "-"
10.20.30.40 - - [21/Jan/2020:22:54:52 +0000] "GET / HTTP/1.0" 200 11337 "-" "-" "-"

Load balancer health check is successful:

And pool members are up and reachable:

Note: technically the host header should contain the hostname of the Tower node we’re making a health check on. But since NSX monitor is configured per pool and not per pool member, we have to use a fake hostname “any.host.com” as a workaround. When I was testing it, Tower didn’t complain.

Reference

Even though I said that the rest of the load-balancer configuration is standard, I still think having screenshots for reference is helpful if you need to validate configuration. So find the full list of settings below.

Screenshot 1: Application Profile

Screenshot 2: Service Monitor

Screenshot 3: Pool

Screenshot 4: Virtual Server

Joining ESXi to AD in Disjoint Namespace

November 4, 2019

What is Disjoint Namespace?

Typically, when using Microsoft Active Directory you use AD-integrated DNS and your AD domain name matches you DNS domain name, but you don’t have to. This is quite rare, but I’ve seen cases where the two don’t match. For example, you might have a Linux-based DNS, where you register an esx01.example.com DNS record for your ESXi host and then you join it to an Active Directory domain called corp.local.

That’s called a disjoint namespace. You can read this Microsoft article if you want to know more details: Disjoint Namespace.

In my personal opinion, using a disjoint namespace is asking for trouble, but it will still work if you really want to use it.

Problem

If you end up going down that route, there’s one caveat you should be aware of. When you joining a machine to AD, among other things, it needs to populate DNS name field property of the AD computer object. This is an example of ESXi computer object in Active Directory Users and Computers snap-in:

If you configure example.com domain in your ESXi Default TCP/IP stack, like so:

And then you try to, for example, join your ESXi host to corp.local AD domain, it will attempt to use esx-01a.example.com for computer object DNS name field. If you’re using a domain account with privileges restricted only to domain join, this operation will fail.

This is how the problem manifested itself in my case in ESXi host logs:

Failed to run provider specific request (request code = 8, provider = ‘lsa-activedirectory-provider’) -> error = 40315, symbol = LW_ERROR_LDAP_CONSTRAINT_VIOLATION, client pid = 2099303

If you’re using host profiles to join ESXi host to the domain, remediation will fail and you will see the following in /var/log/syslog.log:

WARNING: Domain join failed; retry count 1.

WARNING: Domain join failed; retry count 2.

Likewise (ActiveDirectory) Domain Join operation failed while joining new domain via username and password..

Note: this problem is specific to joining domain using a restricted service account. If you use domain administrator account, it will force the controller to add the computer object with a DNS name, which doesn’t match the AD name.

Solution

Make sure ESXi domain name setting matches the Active Directory domain name, not DNS domain name. You can still use the esx-01a.example.com record to add the ESXi hosts to vCenter, but you have to specify corp.local domain in DNS settings (or leave it blank), because this is what is going to be used to add the host to AD, like so:

This way your domain controller will be happy and ESXi host will successfully join the domain.

Additional Notes

While troubleshooting this issue I saw a few errors in ESXi host logs, which were a distraction, ignore them, as they don’t constitute an error.

This just means that the ESXi host Active Directory service is running, but host is not joined to a domain yet:

lsass: Failed to run provider specific request (request code = 12, provider = ‘lsa-activedirectory-provider’) -> error = 2692, symbol = NERR_SetupNotJoined, client pid = 2111366

IPC is inter-process communication. Likewise consists of multiple services that talk to each other. They open and close connections, this is normal:

lsass-ipc: (assoc:0x8ed7e40) Dropping: Connection closed by peer

I also found this command to be useful for deeper packet inspection between an ESXi host and AD domain controllers:

tcpdump-uw -i vmk0 not port 22 and not arp

References

Host Profile Customization Issue

November 1, 2019

vSphere Host Profiles is a great feature for consistent ESXi host configuration and compliance checks, but can at times be flaky.

I’ve noticed an issue recently with Host Profiles in vSphere 6.7, where after providing host customization values the following error is shown in vSphere Web Client:

The “Update host customizations” operation failed for the entity with the following error message.

Host settings validation failed.

This is how the error message looks like in the client:

Even though it’s a bit annoying, I found it to be a furphy. Customizations are actually saved successfully and the error can be ignored. You can find the following messages in ESXi host’s /var/log/syslog.log file, which confirm that it works:


INFO: Execute completed
INFO: Validating AnswerFile Status1 = success
INFO: Cleaned up Host Configuration
INFO: GetAnswerFile completed

I’ve also found that this error doesn’t appear when you provide host customization values first time straight after attaching a profile to the host. Only when you update them. It also doesn’t show up in HTML5, only Web Client. I guess, one more reason to switch to HTML5.

Hope this blog post helps someone who searched in Google, but couldn’t find any information related to this error message.

Unable to Delete vCenter Endpoint in vRealize Automation

December 7, 2018

vRealize Automation Error

More than once in my experience I’ve had a need to delete an endpoint in vRealize Automation. Maybe configuration has changed or you simply made a typo in vCenter hostname or credentials. Once you’ve specified vCenter address and saved the endpoint you can no longer change it (only delete and re-add).

But even when you try to delete it, you will get an error something along the lines of:

You cannot delete this endpoint because 1 compute resources and 0 storage paths use it.

CloudClient Error

There is a KB article that walks you through the process of how to do that using a special tool called CloudClient: Error “This endpoint is being used by # compute resources and # storage paths and cannot be deleted” when you attempt to delete an endpoint in vRA 7.x (2150548)

But even that approach not always work. When you run this command from the KB article “vra computeresource inactive list” you may get the following error:

Error: Something went wrong while processing your request. Please check the application logs for details.

Solution

There is almost no mention of this second error on the Internet and I can see how someone can keep banging his head trying to solve it, so I thought I’d share a solution here. And it’s simple – open a GSS ticket. They can delete the endpoint for you. If you see this error, there’s no other way that I know of to solve this problem without involving GSS.

Clean-up

You can see an error similar to the following in vRA logs if you didn’t stop proxy agents before deleting the endpoint:

Error processing ping response
Error occurred while executing stored proc usp_InsertUpdateHost The INSERT statement conflicted with the FOREIGN KEY constraint “FK_ManagementEndpoint_Host”. The conflict occurred in database “vRa_IaaS”, table “dbo.ManagementEndpoints”, column ‘ManagementEndpointID’.
The statement has been terminated.
Inner Exception: The INSERT statement conflicted with the FOREIGN KEY constraint “FK_ManagementEndpoint_Host”. The conflict occurred in database “vRa_IaaS”, table “dbo.ManagementEndpoints”, column ‘ManagementEndpointID’.
The statement has been terminated.

All you need to do to get rid of it is restart your proxy agents.

Conclusion

Hope this post saves someone the hassle of hours searching for the answer in blogs and forums.

Error When Deploying VCSA or PSC

October 31, 2017

Recently when helping a customer to deploy a new greenfield VMware 6.5 environment I ran into an issue where brand new vCenter Server Appliance and Platform Service Controller 6.5 build 5973321 fail to deploy to an ESXi host build 5969303.

Stage 1 (install) of the deployment completes successfully. In Stage 2 (setup) VCSA installer both for vCenter and PSC first shows a prompt asking for credentials.

PSC Issue Description

After providing credentials, when installing an external PSC, installation fails with the following error:

Error:
Unable to connect to vCenter Single Sign-On: Failed to connect to SSO; uri:https://psc-hostname/sts/STSService/vsphere.local
Failed to register vAPI Endpoint Service with CM
Failed to configure vAPI Endpoint Service at the firstboot time

Resolution:
Please file a bug against VAPI

Installation wizard shows the following resulting error:

Failure:
A problem occurred during setup. Refresh this page and try again.

A problem occurred during setup. Services might not be working as expected.

A problem occurred while – Starting VMware vAPI Endpoint…

Appliance shows the following error in console:

Failed to start services. Firstboot Error.

Alternatively PSC can fail with the following error:

Error:
Unexpected failure: }
Failed to register vAPI Endpoint Service with CM
Failed to configure vAPI Endpoint Service at the firstboot time

Resolution:
Please file a bug against VAPI

VCSA Issue Description

After providing credentials, when installing vCenter with embedded PSC, installation fails with the following error:

Error:
Unable to start the Service Control Agent.

Resolution:
Search for these symptoms in the VMware knowledge base for any known issues and possible workarounds. If none can be found, collect a support bundle and open a support request.

Installation wizard shows the following resulting error:

Failure:
A problem occurred during setup. Refresh this page and try again.

A problem occurred during setup. Services might not be working as expected.

A problem occurred while – Starting VMware Service Control Agent…

Appliance shows the same error in console.

Alternatively VCSA can fail with the following error:

Error:
Encountered an internal error. Traceback (most recent call last): File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 1852, in main vmidentityFB.boot() File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 359, in boot self.checkSTS(self.__stsRetryCount, self.__stsRetryInterval) File “/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py”, line 1406, in checkSTS raise Exception(‘Failed to initialize Secure Token Server.’) Exception: Failed to initialize Secure Token Server.

Resolution:
This is an unrecoverable error, please retry install. If you run into this error again, please collect a support bundle and open a support request.

Issue Workaround

This issue happens when VCSA or PSC installation was cancelled and is attempted for the second time to the same ESXi host.

Identified workaround for this issue is to use another ESXi host, which has never been used to deploy PSC or VCSA to.

Issue Resolution

VMware is aware of the bug and working on the resolution.

Dell Repository Manager: Bootable ISO Issues

May 23, 2016

problem_solutionIn one of my previous posts I described the process of upgrading a Dell FX2 chassis firmware using Dell Repository Manager (DRM).

In an ideal world you just follow the process and in an hour or two you can get your chassis upgraded. You may sometimes run into issues. I want to go through some of them in this post, including possible remediation.

Issue Description

When exporting firmware to a bootable ISO you can find DRM not being able to download some of the bundle components with the following error in the Job Result:

Processing failed:
Failed downloading files:
Diagnostics_Application_PWMC8_LN64_OSC_1.1_A00.BIN

And errors in the Log:


60. 24/03/2016 5:58:50 PM Export to Bootable ISO : Downloaded 34 / 56
61. 24/03/2016 5:59:44 PM Export to Bootable ISO : Error downloading some files
62. 24/03/2016 5:59:45 PM Export to Bootable ISO : Failed exporting to Bootable ISO.

Workaround #1: Skip the Component

You can try the following option “Continue download irrespective of any error (in the selected components)” in the export dialog. It won’t help to get the component downloaded, but you will got a bootable ISO.

However, DRM will still keep the failed component in the bundle and try to install it during the upgrade, which will obviously fail (update 16/56):

failed_update

Once the upgrade is finished you will get the following error at the end:

Note: Some update requires machine reboot. Please reboot to CD/DVD to continue for the failed update because of the dependency…

upgrade_status

No matter how many times you reboot you will obviously get the same errors. You can ignore it if you 100% sure this is what causes the upgrade to fail or use Workaround #2.

Workaround #2: Create Custom ISO

When you create a repository in DRM it’s populated with pre-built components and bundles. But you can create custom repositories. The idea is that you can exclude the failed component from the repository by creating it manually.

Assuming you already have the base repository configured, do the following:

  • Open the existing repository and click on the Components tab
  • Deselect the failed component in the component list (in my case it was Diagnostics_Application_PWMC8_LN64_OSC_1.1_A00.BIN)
  • Click on the “Copy To” button:

custom_components

  • In the opened dialogue select “Create NEW Repository and copy component(s) into it”
  • Follow the wizard and when you click finish, components will be copied to the newly created repository
  • Open the new repository and click on the Componenets tab
  • Select all components and click on the “Copy to” button once again
  • This time select “Create a NEW Bundle in the same repository and add component(s) into it”
  • On the next screen give the bundle a name and make sure to choose “Linux 32-bit and 64-bit” in the OS Type

custom_bundle

As a result you should get a new bundle created which you can export to a bootable ISO using the same process.

Workaround #3: Use Server Update Utility

If none of the above helps you can fall back to a proven upgrade approach and use Server Update Utility (SUU). SUU is a huge 12GB ISO to download, but you can use Dell Download Manager, which supports resuming interrupted downloads. Make sure to disable proxy! Dell Download Manager does not support resuming an interrupted download if you’re using a proxy server.

SUU is not a bootable ISO. Previously you had to use Dell Systems Build and Update Utility (SBUU) to boot from it first and then mount the ISO to proceed with the upgrade. Starting with Dell 11G servers you don’t need it anymore and can upgrade firmware straight form Dell Lifecycle Controller (LC).

You’ll need to boot into the Lifecycle Controller and choose Firmware Update > Launch Firmware Update > Local Drive(CD or DVD or USB). Mount the SUU ISO and the rest is fairly straightforward. LC will upgrade the firmware and reboot the blade.

lc_upgrade

Conclusion

Dell Repository Manager is the recommended approach to upgrade firmware on Dell hardware. Unlike SUU, DRM downloads the latest updates and only the necessary components. It is also capable of making a bootable ISO.

If you have issues, rely on Server Update Utility as it’s bulletproof and always work. But be prepared to download a 12GB ISO image and make sure you have an option to bypass proxy.

Requirements for Unmounting a VMware Datastore

December 30, 2015

I have come across issues unmounting VMware datastores myself multiple times. In recent vSphere versions vCenter shows you a warning if some of the requirements are not fulfilled. It is not the case in the older vSphere versions, which makes it harder to identify the issue.

Interestingly, there are some pre-requisites which even vCenter does not prompt you about. I will discuss all of the requirements in this post.

General Requirements

In this category I combine all requirements which vCenter checks against, such as:

Requirement: No virtual machine resides on the datastore.

Action: You have to make sure that the host you are unmounting the datastore from has no virtual machines (running or stopped) registered on this datastore.  If you are unmounting just one datastore from just one host, you can simply vMotion all VMs residing on the datastore from this host to the remaining hosts. If you are unmounting the datastore from all hosts, you’ll have to either Storage vMotion all VMs to the remaining datastores or shutdown the VMs and unregister them from vCenter.

unmount_vmfs2

Requirement: The datastore is not part of a Datastore Cluster.

Requirement: The datastore is not managed by storage DRS.

Action: Drag and drop the datastore from the Datastore Cluster in vCenter to move it out of the Datastore Cluster. Second requirement is redundant, because SDRS is enabled on a datastore which is configured withing a Datastore Cluster. By removing a datastore from a Datastore Cluster you atomatically disable storage DRS on it.

Requirement: Storage I/O control is disabled for this datastore.

Action: Go to the datastore properties and uncheck Storage I/O Control option. On a SIOC-enabled datastore vSphere creates a folder named after the block device ID and keeps a file called “slotsfile” in it. Its size will change to 0.00 KB once SIOC is disabled.

Requirement: The datastore is not used for vSphere HA heartbeat.

Action: vSphere HA automatically selects two VMware datastores, creates .vSphere-HA folders and use them to keep HA heartbeats. If you have more than two datastores in your cluster, you can control which datastores are selected. Go to cluster properties > Datastore Heartbeating (under vSphere HA section) and select preferred datastores from the list. This will work if you are unmounting one datastore. If you need to unmount all datastores, you will have to disable HA on the cluster level altogether.

datastore_heartbeat

Additional Requirements

Requirements which fall in this category are not checked by vCenter, but are still have to be satisfied. Otherwise vCenter will not let you unmount the datastore.

Requirement: The datastore is not used for swap.

Action: When VM is powered on by default it creates a swap file in the VM directory with .vswp extension. You can change the default behavior and on a per host basis select a dedicated datastore where host will be creating swap files for virtual machines. This setting is enabled in cluster properties in Swapfile Location section. The datastore is then selected for each host in Virtual Machine Swapfile Location settings on the the host configuration tab.

What host also does when you enable this option is it creates a host local swap file, which is named something like this: sysSwap-hls-55de2f14-6c5d-4d50-5cdf-000c296fc6a7.swp

There are scenarios where you need to unmount the swap datastore, such as when you say need to reconnect all of your storage from FC to iSCSI. Even if you shutdown all of your VMs, datastore unmount will fail because the host swap files are still there and you will see an error such as this:

The resource ‘Datastore Name: iSCSI1 VMFS uuid: 55de473c-7f3ae2b5-f9f8-000c29ba113a’ is in use.

See the error stack for details on the cause of the problem.

Error Stack:

Call “HostStorageSystem.UnmountVmfsVolume” for object “storageSystem-29” on vCenter Server “VC.lab.local” failed.

Cannot unmount volume ‘Datastore Name: iSCSI1 VMFS uuid: 55de473c-7f3ae2b5-f9f8-000c29ba113a’ because file system is busy. Correct the problem to retry the operation.

The workaround is to change the setting on the cluster level to store VM swap file in VM directory and reboot all hosts. After a reboot the host .swp file will disappear.

If rebooting the hosts is not desirable, you can SSH to each host and type the following command:

# esxcli sched swap system set –hostlocalswap-enabled false

To confirm that the change has taken effect run:

# esxcli sched swap system get

Then check the datastore and the .swp files should no longer be there.

Conclusion

If you satisfy all of the above requirements you should have no problems when unmounting VMware datastores. vSphere creates a few additional system folders on each of the datastores, such as .sdd.sf and .dvsData, but I personally have never had issues with them.

VNX LDAP Integration: AD Nested Groups

February 11, 2015

Have you ever stumbled upon AD authentication issues on VNX, even though it all looked configured properly? LDAP integration has always been a PITA on storage arrays and blade chassis as usually there is no way to troubleshoot what the actual error is.

auth_error

If VNX cannot lookup the user or group that you’re trying to authenticate against in AD, you’ll see just this. Now go figure why it’s getting upset about it. Even though you can clearly see the group configured in “Role Mapping” and there doesn’t seem to be any typos.

Common problem is Nested Groups. By default VNX only checks if your account is under the specified AD group and doesn’t traverse the hierarchy. So for example, if your account is under the group called IT_Admins in AD, IT_Admins is added to Domain Admins and Domain Admins is in “Role Mapping” – it’s not gonna work.

nested_groups

To make it work change “Nested Group Level” to something appropriate for you and this’d resolve the issue and make your life happier.