Posts Tagged ‘ESXi’

Quick Way to Migrate VMs Between Standalone ESXi Hosts

September 26, 2017

Introduction

Since vSphere 5.1, VMware offers an easy migration path for VMs running on hosts managed by a vCenter. Using Enhanced vMotion available in Web Client, VMs can be migrated between hosts, even if they don’t have shared datastores. In vSphere 6.0 cross vCenter vMotion(xVC-vMotion) was introduced, which no longer requires you to even have old and new hosts be managed by the same vCenter.

But what if you don’t have a vCenter and you need to move VMs between standalone ESXi hosts? There are many tools that can do that. You can use V2V conversion in VMware Converter or replication feature of the free version of Veeam Backup and Replication. But probably the easiest tool to use is OVF Tool.

Tool Overview

OVF Tool has been around since Open Virtualization Format (OVF) was originally published in 2008. It’s constantly being updated and the latest version 4.2.0 supports vSphere up to version 6.5. The only downside of the tool is it can export only shut down VMs. It’s may cause problems for big VMs that take long time to export, but for small VMs the tool is priceless.

Installation

OVF Tool is a CLI tool that is distributed as an MSI installer and can be downloaded from VMware web site. One important thing to remember is that when you’re migrating VMs, OVF Tool is in the data path. So make sure you install the tool as close to the workload as possible, to guarantee the best throughput possible.

Usage Examples

After the tool is installed, open Windows command line and change into the tool installation directory. Below are three examples of the most common use cases: export, import and migration.

Exporting VM as an OVF image:

> ovftool “vi://username:password@source_host/vm_name” “vm_name.ovf”

Importing VM from an OVF image:

> ovftool -ds=”destination_datastore” “vm_name.ovf” “vi://username:password@destination_host”

Migrating VM between ESXi hosts:

> ovftool -ds=”destination_datastore” “vi://username:password@source_host/vm_name” “vi://username:password@destination_host”

When you are migrating, machine the tool is running on is still used as a proxy between two hosts, the only difference is you are not saving the OVF image to disk and don’t need disk space available on the proxy.

This is what it looks like in vSphere and HTML5 clients’ task lists:

Observations

When planning migrations using OVF Tool, throughput is an important consideration, because migration requires downtime.

OVF Tool is quite efficient in how it does export/import. Even for thick provisioned disks it reads only the consumed portion of the .vmdk. On top of that, generated OVF package is compressed.

Due to compression, OVF Tool is typically bound by the speed of ESXi host’s CPU. In the screenshot below you can see how export process takes 1 out of 2 CPU cores (compression is singe-threaded).

While testing on a 2 core Intel i5, I was getting 25MB/s read rate from disk and an average export throughput of 15MB/s, which is roughly equal to 1.6:1 compression ratio.

For a VM with a 100GB disk, that has 20GB of space consumed, this will take 20*1024/25 = 819 seconds or about 14 minutes, which is not bad if you ask me. On a Xeon CPU I expect throughput to be even higher.

Caveats

There are a few issues that you can potentially run into that are well-known, but I think are still worth mentioning here.

Special characters in URIs (string starting with vi://) must be escaped. Use % followed by the character HEX code. You can find character HEX codes here: http://www.techdictionary.com/ascii.html.

For example use “vi://root:P%40ssword@10.0.1.10”, instead of “vi://root:P@ssword@10.0.1.10” or you can get confusing errors similar to this:

Error: Could not lookup host: root

Disconnect ISO images from VMs before migrating them or you will get the following error:

Error: A general system error occurred: vim.fault.FileNotFound

Conclusion

OVF Tool requires downtime when exporting, importing or migrating VMs, which can be a deal-breaker for large scale migrations. When downtime is not a concern or for VMs that are small enough for the outage to be minimal, from now on OVF Tool will be my migration tool of choice.

Advertisements

Force10 and vSphere vDS Interoperability Issue

June 10, 2016

dell-force10Recently I had an opportunity to work with Dell FX2 platform from the design and delivery point of view. I was deploying a FX2s chassis with FC630 blades and FN410S 10Gb I/O aggregators.

I ran into an interesting interoperability glitch between Force10 and vSphere distributed switch when using LLDP. LLDP is an equivalent of Cisco CDP, but is an open standard. And it allows vSphere administrators to determine which physical switch port a given vSphere distributed switch uplink is connected to. If you enable both Listen and Advertise modes, network administrators can get similar visibility, but from the physical switch side.

In my scenario, when LLDP was enabled on a vSphere distributed switch, uplinks on all ESXi hosts started disconnecting and connecting back intermittently, with log errors similar to this:

Lost uplink redundancy on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is down.

Network connectivity restored on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is up

Uplink redundancy restored on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is up

Issue Troubleshooting

FX2 I/O aggregator logs were reviewed for potential errors and the following log entries were found:

%STKUNIT0-M:CP %DIFFSERV-5-DSM_DCBX_PFC_PARAMETERS_MISMATCH: PFC Parameters MISMATCH on interface: Te 0/2

%STKUNIT0-M:CP %IFMGR-5-OSTATE_DN: Changed interface state to down: Te 0/2

%STKUNIT0-M:CP %IFMGR-5-OSTATE_UP: Changed interface state to up: Te 0/2

This clearly looks like some DCB negotiation issue between Force10 and the vSphere distributed switch.

Root Cause

Priority Flow Control (PFC) is one of the protocols from the Data Center Bridging (DCB) family. DCB was purposely built for converged network environments where you use 10Gb links for both Ethernet and FC traffic in the form of FCoE. In such scenario, PFC can pause Ethernet frames when FC is not having enough bandwidth and that way prioritise the latency sensitive storage traffic.

In my case NIC ports on Qlogic 57840 adaptors were used for 10Gb Ethernet and iSCSI and not FCoE (which is very uncommon unless you’re using Cisco UCS blade chassis). So the question is, why Force10 switches were trying to negotiate FCoE? And what did it have to do with enabling LLDP on the vDS?

The answer is simple. LLDP not only advertises the port numbers, but also the port capabilities. Data Center Bridging Exchange Protocol (DCBX) uses LLDP when conveying capabilities and configuration of FCoE features between neighbours. This is why enabling LLDP on the vDS triggered this. When Force10 switches determined that vDS uplinks were CNA adaptors (which was in fact true, I was just not using FCoE) it started to negotiate FCoE using DCBX. Which didn’t really go well.

Solution

The easiest solution to this problem is to disable DCB on the Force10 switches using the following command:

# conf t
# no dcb enable

Alternatively you can try and disable FCoE from the ESXi end by using the following commands from the host CLI:

# esxcli fcoe nic list
# esxcli fcoe nic disable -n vmnic0

Once FCoE has been disabled on all NICs, run the following command and you should get an empty list:

# esxcli fcoe adapter list

Conclusion

It is still not clear why PFC mismatch would cause vDS uplinks to start flapping. If switch cannot establish a FCoE connection it should just ignore it. Doesn’t seem to be the case on Force10. So if you run into a similar issue, simply disable DCB on the switches and it should fix it.

Dell Compellent is not an ALUA Storage Array

May 16, 2016

dell_compellentDell Compellent is Dell’s flagship storage array which competes in the market with such rivals as EMC VNX and NetApp FAS. All these products have slightly different storage architectures. In this blog post I want to discuss what distinguishes Dell Compellent from the aforementioned arrays when it comes to multipathing and failover. This may help you make right decisions when designing and installing a solution based on Dell Compellent in your production environment.

Compellent Array Architecture

In one of my previous posts I showed how Compellent LUNs on vSphere ESXi hosts are claimed by VMW_SATP_DEFAULT_AA instead of VMW_SATP_ALUA SATP, which is the default for all ALUA arrays. This happens because Compellent is not actually an ALUA array and doesn’t have the tpgs_on option enabled. Let’s digress for a minute and talk about what the tpgs_on option actually is.

For a storage array to be claimed by VMW_SATP_ALUA it has to have the tpgs_on option enabled, as indicated by the corresponding SATP claim rule:

# esxcli storage nmp satp rule list

Name                 Transport  Claim Options Description
-------------------  ---------  ------------- -----------------------------------
VMW_SATP_ALUA                   tpgs_on       Any array with ALUA support

This is how Target Port Groups (TPG) are defined in section 5.8.2.1 Introduction to asymmetric logical unit access of SCSI Primary Commands – 3 (SPC-3) standard:

A target port group is defined as a set of target ports that are in the same target port asymmetric access state at all times. A target port group asymmetric access state is defined as the target port asymmetric access state common to the set of target ports in a target port group. The grouping of target ports is vendor specific.

This has to do with how ports on storage controllers are grouped. On an ALUA array even though a LUN can be accessed through either of the controllers, paths only to one of them (controller which owns the LUN) are Active Optimized (AO) and paths to the other controller (non-owner) are Active Non-Optimized (ANO).

Compellent does not present LUNs through the non-owning controller. You can easily see that if you go to the LUN properties. In this example we have four iSCSI ports connected (two per controller) on the Compellent side, but we can see only two paths, which are the paths from the owning controller.

compellent_psp

If Compellent presents each particular LUN only through one controller, then how does it implement failover? Compellent uses a concept of fault domains and control ports to handle LUN failover between controllers.

Compellent Fault Domains

This is Dell’s definition of a Fault Domain:

Fault domains group front-end ports that are connected to the same Fibre Channel fabric or Ethernet network. Ports that belong to the same fault domain can fail over to each other because they have the same connectivity.

So depending on how you decided to go about your iSCSI network configuration you can have one iSCSI subnet / one fault domain / one control port or two iSCSI subnets / two fault domains / two control ports. Either of the designs work fine, this is really is just a matter of preference.

You can think of a Control Port as a Virtual IP (VIP) for the particular iSCSI subnet. When you’re setting up iSCSI connectivity to a Compellent, you specify Control Ports IPs in Dynamic Discovery section of the iSCSI adapter properties. Which then redirects the traffic to the actual controller IPs.

If you go to the Storage Center GUI you will see that Compellent also creates one virtual port for every iSCSI physical port. This is what’s called a Virtual Port Mode and is recommended instead of a Physical Port Mode, which is the default setting during the array initialization.

Failover scenarios

Now that we now what fault domains are, let’s talk about the different failover scenarios. Failover can happen on either a port level when you have a transceiver / cable failure or a controller level, when the whole controller goes down or is rebooted. Let’s discuss all of these scenarios and their variations one by one.

1. One Port Failed / One Fault Domain

If you use one iSCSI subnet and hence one fault domain, when you have a port failure, Compellent will move the failed port to the other port on the same controller within the same fault domain.

port_failed

In this example, 5000D31000B48B0E and 5000D31000B48B0D are physical ports and 5000D31000B48B1D and 5000D31000B48B1C are the corresponding virtual ports on the first controller. Physical port 5000D31000B48B0E fails. Since both ports on the controller are in the same fault domain, controller moves virtual port 5000D31000B48B1D from its original physical port 5000D31000B48B0E to the physical port 5000D31000B48B0D, which still has connection to network. In the background Compellent uses iSCSI redirect command on the Control Port to move the traffic to the new virtual port location.

2. One Port Failed / Two Fault Domains

Two fault domains scenario is slightly different as now on each controller there’s only one port in each fault domain. If any of the ports were to fail, controller would not fail over the port. Port is failed over only within the same controller/domain. Since there’s no second port in the same fault domain, the virtual port stays down.

port_failed_2

A distinction needs to be made between the physical and virtual ports here. Because from the physical perspective you lose one physical link in both One Fault Domain and Two Fault Domains scenarios. The only difference is, since in the latter case the virtual port is not moved, you’ll see one path down when you go to LUN properties on an ESXi host.

3. Two Ports Failed

This is the scenario which you have to be careful with. Compellent does not initiate a controller failover when all front-end ports on a controller fail. The end result – all LUNs owned by this controller become unavailable.

two_ports_failed_2

luns_down

This is the price Compellent pays for not supporting ALUA. However, such scenario is very unlikely to happen in a properly designed solution. If you have two redundant network switches and controllers are cross-connected to both of them, if one switch fails you lose only one link per controller and all LUNs stay accessible through the remaining links/switch.

4. Controller Failed / Rebooted

If the whole controller fails the ports are failed over in a similar fashion. But now, instead of moving ports within the controller, ports are moved across controllers and LUNs come across with them. You can see how all virtual ports have been failed over from the second (failed) to the first (survived) controller:

controller_failed

Once the second controller gets back online, you will need to rebalance the ports or in other words move them back to the original controller. This doesn’t happen automatically. Compellent will either show you a pop up window or you can do that by going to System > Setup > Multi-Controller > Rebalance Local Ports.

Conclusion

Dell Compellent is not an ALUA storage array and falls into the category of Active/Passive arrays from the LUN access perspective. Under such architecture both controller can service I/O, but each particular LUN can be accessed only through one controller. This is different from the ALUA arrays, where LUN can be accessed from both controllers, but paths are active optimized on the owning controller and active non-optimized on the non-owning controller.

From the end user perspective it does not make much of a difference. As we’ve seen, Compellent can handle failover on both port and controller levels. The only exception is, Compellent doesn’t failover a controller if it loses all front-end connectivity, but this issue can be easily avoided by properly designing iSCSI network and making sure that both controllers are connected to two upstream switches in a redundant fashion.

How Admission Control Really Works

May 2, 2016

confusionThere is a moment in every vSphere admin’s life when he faces vSphere Admission Control. Quite often this moment is not the most pleasant one. In one of my previous posts I talked about some of the common issues that Admission Control may cause and how to avoid them. And quite frankly Admission Control seems to do more harm than good in most vSphere environments.

Admission Control is a vSphere feature that is built to make sure that VMs with reservations can be restarted in a cluster if one of the cluster hosts fails. “Reservations” is the key word here. There is a common belief that Admission Control protects all other VMs as well, but that’s not true.

Let me go through all three vSphere Admission Control policies and explain why you’re better of disabling Admission Control altogether, as all of these policies give you little to no benefit.

Host failures cluster tolerates

This policy is the default when you deploy a vSphere cluster and policy which causes the most issues. “Host failures cluster tolerates” uses slots to determine if a VM is allowed to be powered on in a cluster. Depending on whether VM has CPU and memory reservations configured it can use one or more slots.

Slot Size

To determine the total number of slots for a cluster, Admission Control uses slot size. Slot size is either the default 32MHz and 128MB of RAM (for vSphere 6) or if you have VMs in the cluster configured with reservations, then the slot size will be calculated based on the maximum CPU/memory reservation. So say if you have 100 VMs, 98 of which have no reservations, one VM has 2 vCPUs and 8GB of memory reserved and another VM has 4 vCPUs and 4GB of memory reserved, then the slot size will jump from 32MHz / 128MB to 4 vCPUs / 8GB of memory. If you have 2.0 GHz CPUs on your hosts, the 4 vCPU reservation will be an equivalent of 8.0 GHz.

Total Number of Slots

Now that we know the slot size, which happens to be 8.0 GHz and 8GB of memory, we can calculate the total number of slots in the cluster. If you have 2 x 8 core CPUs and 256GB of RAM in each of 4 ESXi hosts, then your total amount of resources is 16 cores x 2.0 GHz x 4 hosts = 128 GHz and 256GB x 4 hosts = 1TB of RAM. If your slot size is 4 vCPUs and 8GB of RAM, you get 64 vCPUs / 4 vCPUs = 16 slots (you’ll get more for memory, but the least common denominator has to be used).

total_slots

Practical Use

Now if you configure to tolerate one host failure, you have to subtract four slots from the total number. Every VM, even if it doesn’t have reservations takes up one slot. And as a result you can power on maximum 12 VMs on your cluster. How does that sound?

Such incredibly restrictive behaviour is the reason why almost no one uses it in production. Unless it’s left there by default. You can manually change the slot size, but I have no knowledge of an approach one would use to determine the slot size. That’s the policy number one.

Percentage of cluster resources reserved as failover spare capacity

This is the second policy, which is commonly recommended by most to use instead of the restrictive “Host failures cluster tolerates”. This policy uses percentage-based instead of the slot-based admission.

It’s much more straightforward, you simply specify the percentage of resources you want to reserve. For example if you have four hosts in a cluster the common belief is that if you specify 25% of CPU and memory, they’ll be reserved to restart VMs in case one of the hosts fail. But it won’t. Here’s the reason why.

When calculating amount of free resources in a cluster, Admission Control takes into account only VM reservations and memory overhead. If you have no VMs with reservations in your cluster then HA will be showing close to 99% of free resources even if you’re running 200 VMs.

failover_capacity

For instance, if all of your VMs have 4 vCPUs and 8GB of RAM, then memory overhead would be 60.67MB per VM. For 300 VMs it’s roughly 18GB. If you have two VMs with reservations, say one VM with 2 vCPUs / 4GB of RAM and another VM with 4 vCPUs / 2GB of RAM, then you’ll need to add up your reservations as well.

So if we consider memory, it’s 18GB + 4GB + 2GB = 24GB. If you have the total of 1TB of RAM in your cluster, Admission Control will consider 97% of your memory resources being free.

For such approach to work you’d need to configure reservations on 100% of your VMs. Which obviously no one would do. So that’s the policy number two.

Specify failover hosts

This is the third policy, which typically is the least recommended, because it dedicates a host (or multiple hosts) specifically just for failover. You cannot run VMs on such hosts. If you try to vMotion a VM to it, you’ll get an error.

failover_host

In my opinion, this policy would actually be the most useful for reserving cluster resources. You want to have N+1 redundancy, then reserve it. This policy does exactly that.

Conclusion

When it comes to vSphere Admission Control, everyone knows that “Host failures cluster tolerates” policy uses slot-based admission and is better to be avoided.

There’s a common misconception, though, that “Percentage of cluster resources reserved as failover spare capacity” is more useful and can reserve CPU and memory capacity for host failover. But in reality it’ll let you run as many VMs as you want and utilize all of your cluster resources, except for the tiny amount of CPU and memory for a handful of VMs with reservations you may have in your environment.

If you want to reserve failover capacity in your cluster, either use “Specify failover hosts” policy or simply disable Admission Control and keep an eye on your cluster resource utilization manually (or using vROps) to make sure you always have room for growth.

Changing the Default PSP for Dell Compellent

April 26, 2016

dell_compellentIf you’ve ever worked with Dell Compellent storage arrays you may have noticed that when you initially connect it to a VMware ESXi host, by default VMware Native Multipathing Plugin (NMP) uses Fixed Path Selection Policy (PSP) for all connected LUNs. And if you have two ports on each of the controllers connected to your storage area network (be it iSCSI or FC), then you’re wasting half of your bandwidth.

compellent_psp

Why does that happen? Let’s dig deep into VMware’s Pluggable Storage Architecture (PSA) and see how it treats Compellent.

How Compellent is claimed by VMware NMP

If you are familiar with vSphere’s Pluggable Storage Architecture (PSA) and NMP (which is the only PSA plug-in that every ESXi host has installed by default), then you may know that historically it’s always had specific rules for such Asymmetric Logical Unit Access (ALUA) arrays as NetApp FAS and EMC VNX.

Run the following command on an ESXi host and you will see claim rules for NetApp and DGC devices (DGC is Data General Corporation, which built Clariion array that has been later re-branded as VNX by EMC):

# esxcli storage nmp satp rule list

Name              Vendor  Default PSP Description
----------------  ------- ----------- -------------------------------
VMW_SATP_ALUA_CX  DGC                 CLARiiON array in ALUA mode
VMW_SATP_ALUA     NETAPP  VMW_PSP_RR  NetApp arrays with ALUA support

This tells NMP to use Round-Robin Path Selection Policy (PSP) for these arrays, which is always preferable if you want to utilize all available active-optimized paths. You may have noticed that there’s no default PSP in the VNX claim rule, but if you look at the default PSP for the VMW_SATP_ALUA_CX Storage Array Type Plug-In (SATP), you’ll see that it’s also Round-Robin:

# esxcli storage nmp satp list

Name              Default PSP  
----------------- -----------
VMW_SATP_ALUA_CX  VMW_PSP_RR

There is, however, no default claim rule for Dell Compellent storage arrays. There are a handful of the following non array-specific “catch all” rules:

Name                 Transport  Claim Options Description
-------------------  ---------  ------------- -----------------------------------
VMW_SATP_ALUA                   tpgs_on       Any array with ALUA support
VMW_SATP_DEFAULT_AA  fc                       Fibre Channel Devices
VMW_SATP_DEFAULT_AA  fcoe                     Fibre Channel over Ethernet Devices
VMW_SATP_DEFAULT_AA  iscsi                    iSCSI Devices

As you can see, the default PSP for VMW_SATP_ALUA is Most Recently Used (MRU) and for VMW_SATP_DEFAULT_AA it’s VMW_PSP_FIXED:

Name                Default PSP   Description
------------------- ------------- ------------------------------------------
VMW_SATP_ALUA       VMW_PSP_MRU
VMW_SATP_DEFAULT_AA VMW_PSP_FIXED Supports non-specific active/active arrays

Compellent is not an ALUA storage array and doesn’t have the tpgs_on option enabled. As a result it’s claimed by the VMW_SATP_DEFAULT_AA rule for the iSCSI transport, which is why you end up with the Fixed path selection policy for all LUNs by default.

Changing the default PSP

Now let’s see how we can change the PSP from Fixed to Round Robin. First thing you have to do before even attempting to change the PSP is to check VMware Compatibility List to make sure that the round robin PSP is supported for a particular array and vSphere combination.

vmware_hcl

As you can see, round robin path selection policy is supported for Dell Compellent storage arrays in vSphere 6.0u2. So let’s change it to get the benefit of being able to simultaneously use all paths to Compellent controllers.

For Compellent firmware versions 6.5 and earlier use the following command to change the default PSP:

# esxcli storage nmp satp set -P VMW_PSP_RR -s VMW_SATP_DEFAULT_AA

Note: technically here you’re changing PSP not specifically for the Compellent storage array, but for any array which is claimed by VMW_SATP_DEFAULT_AA and which also doesn’t have an individual SATP rule with PSP set. Make sure that this is not the case or you may accidentally change PSP for some other array you may have in your environment.

The above will change PSP for any newly provisioned and connected LUNs. For any existing LUNs you can change PSP either manually in each LUN’s properties or run the following command in PowerCLI:

# Get-Cluster ClusterNameHere | Get-VMHost | Get-ScsiLun | where {$_.Vendor -eq
“COMPELNT” –and $_.Multipathpolicy -eq “Fixed”} | Set-ScsiLun -Multipathpolicy
RoundRobin

This is what you should see in LUN properties as a result:

compellent_psp_2

Conclusion

By default any LUN connected from a Dell Compellent storage array is claimed by NMP using Fixed path selection policy. You can change it to Round Robin using the above two simple commands to make sure you utilize all storage paths available to ESXi hosts.

Implications of Ignoring vSphere Admission Control

April 5, 2016

no-admissionHA Admission Control has historically been on of the lesser understood vSphere topics. It’s not intuitive how it works and what it does. As a result it’s left configured with default values in most vSphere environments. But default Admission Control setting are very restrictive and can often cause issues.

In this blog post I want to share the two most common issues with vSphere Admission Control and solutions to these issues.

Issue #1: Not being able to start a VM

Description

Probably the most common issue everyone encounters with Admission Control is when you suddenly cannot power on VMs any more. There are multiple reasons why that might happen, but most likely you’ve just configured a reservation on one of your VMs or deployed a VM from an OVA template with a pre-configured reservation. This has triggered a change in Admission Control slot size and based on the new slot size you no longer have enough slots to satisfy failover requirements.

As a result you get the following alarm in vCenter: “Insufficient vSphere HA failover resources”. And when you try to create and boot a new VM you get: “Insufficient resources to satisfy configured failover level for vSphere HA”.

admission_error

Cause

So what exactly has happened here. In my example a new VM with 4GHz of CPU and 4GB of RAM was deployed. Admission Control was set to its default “Host Failures Cluster Tolerates” policy. This policy uses slot sizes. Total amount of resources in the cluster is divided by the slot size (4GHz and 4GB in the above case) and then each VM (even if it doesn’t have a reservation) uses at least 1 slot. Once you configure a VM reservation, depending on the number of VMs in your cluster more often than not you get all slots being used straight away. As you can see based on the calculations I have 91 slots in the cluster, which have instantly been used by 165 running VMs.

slot_calculations

Solution

You can control the slot size manually and make it much smaller, such as 1GHz and 1GB of RAM. That way you’d have much more slots. The VM from my previous example would use four slots. And all other VMs which have no reservations would use less slots in total, because of a smaller slot size. But this process is manual and prone to error.

The better solution is to use “Percentage of Cluster Resources” policy, which is recommended for most environments. We’ll go over the main differences between the three available Admission Control policies after we discuss the second issue.

Issue #2: Not being able to enter Maintenance Mode

Description

It might be a corner case, but I still see it quite often. It’s when you have two hosts in a cluster (such as ROBO, DR or just a small environment) and try to put one host into maintenance mode.

The first issue you will encounter is that VMs are not automatically vMotion’ed to other hosts using DRS. You have to evacuate VMs manually.

And then once you move all VMs to the other host and put it into maintenance mode, you again can no longer power on VMs and get the same error: “Insufficient resources to satisfy configured failover level for vSphere HA”.

poweron_fail

Cause

This happens because disconnected hosts and hosts in maintenance mode are not used in Admission Control calculations. And one host is obviously not enough for failover, because if it fails, there are no other hosts to fail over to.

Solution

If you got caught up in such situation you can temporarily disable Admission Control all together until you finish maintenance. This is the reason why it’s often recommended to have at least 3 hosts in a cluster, but it can not always be justified if you have just a handful of VMs.

Alternatives to Slot Size Admission Control

There are another two Admission Control policies. First is “Specify a Failover Host”, which dedicates a host (or hosts) for failover. Such host acts as a hot standby and can run VMs only in a failover situation. This policy is ideal if you want to reserve failover resources.

And the second is “Percentage of Cluster Resources”. Resources under this policy are reserved based on the percentage of total cluster resources. If you have five hosts in your cluster you can reserve 20% of resources (which is equal to one host) for failover.

This policy uses percentage of cluster resources, instead of slot sizes, and hence doesn’t have the issues of the “Host Failures Cluster Tolerates” policy. There is a gotcha, if you add another five hosts to your cluster, you will need to change reservation to 10%, which is often overlooked.

Conclusion

“Percentage of Cluster Resources” policy is recommended to use in most cases to avoid issues with slot sizes. What is important to understand is that the goal of this policy is just to guarantee that VMs with reservations can be restarted in a host failure scenario.

If a VM has no reservations, then “Percentage of Cluster Resources” policy will use only memory overhead of this VM in its calculations. Which is probably the most confusing part about Admission Control in general. But that’s a topic for the next blog post.

 

ESXi Host Maintenance with Zerto

February 1, 2016

zerto2Zerto replication is quite easy to configure. Once you have a Zerto Virtual Manager (ZVM) and Virtual Replication Adaptors (VRA) up and running at both sites, you can start adding your virtual machines to replication. There is, however, one question which comes up a lot from the operations point of view. What if you have replication going between the sites and you need to put one of your ESXi hosts into maintenance mode, would that break the replication? The answer is as always – it depends.

Source Site

In Zerto you typically have VRAs installed on each of the hosts at both sites and traffic going one way – from Production data centre to DR. Now, if you want to do maintenance on one of the hosts where VMs are being replicated FROM (Production site) then all you need to do is vMotion VMs to the remaining hosts. Zerto fully supports vMotion and the process is seamless. When VMs are moved to other hosts, VRAs on these hosts automatically pick them up and replication continues without user’s intervention.

Destination Site

If you want to do maintenance on one of the hosts where VMs are being replicated TO (DR site), then this is where you need to be more careful. VMs replicated by Zerto are not shown in vCenter inventory and obviously can’t be moved using conventional vMotion method. This is done from ZVM’s GUI.

zerto_vra

In ZVM find the host you want to put into maintenance mode on the Setup tab and in the More drop-down menu select Change VM Recovery VRA. Select the replacement host where you want to redirect VM replication to and click Save. What this option does in Zerto is somewhat similar to what vMotion does in vSphere – it migrates VMs between VRAs.

Once you hit the button, VMs’ RPO will start to grow until the migration is finished. In my case for 12 VMs the process took about 5 minutes to complete. If you have dozens of protected VMs on each of the VRAs, it may take significantly longer. If it’s a concern, you may want to allocate a maintenance windows for this activity.

zerto_rpo

You will also get a warning that the migration will result in a bitmap-sync. Bitmap Sync tracks the changed blocks on a VM when replication to the destination VRA is interrupted. The amount of changed data over a 5 minute period should be reasonably small. And in my experience VMs get back in sync after a migration very quickly.

When all replicated VMs are moved to another recovery host, you can vMotion out any VMs you may have running on the host, shut down the VRA and put the host into maintenance mode to carry out the maintenance activities.

Once that’s finished, just do the reverse. Disable maintenance mode on the host, boot up the VRA and move back the migrated VMs. In the Change VM Recovery VRA dialogue you can select a completely different set of VMs to move back. As long as you keep them balanced between all VRAs in your cluster you should be good.

Force10 MXL: Initial Configuration

March 14, 2015

Continuing a series of posts on how to deal with Force10 MXL switches. This one is about VLANs, port channels, tagging and all the basic stuff. It’s not much different from other vendors like Cisco or HP. At the end of the day it’s the same networking standards.

If you want to match the terminology with Cisco for instance, then what you used to as EtherChannels is Port Channels on Force10. And trunk/access ports from Cisco are called tagged/untagged ports on Force10.

Configure Port Channels

If you are after dynamic LACP port channels (as opposed to static), then they are configured in two steps. First step is to create a port channel itself:

# conf t
# interface port-channel 1
# switchport
# no shutdown

And then you enable LACP on the interfaces you want to add to the port channel. I have a four switch stack and use 0/.., 1/.. type of syntax:

# conf t
# int range te0/51-52 , te1/51-52 , te2/51-52 , te3/51-52
# port-channel-protocol lacp
# port-channel 1 mode active

To check if the port channel has come up use this command. Port channel obviously won’t init if it’s not set up on the other side of the port channel as well.

# show int po1 brief

port_channel

Configure VLANs

Then you create your VLANs and add ports. Typically if you have vSphere hosts connected to the switch, you tag traffic on ESXi host level. So both your host ports and port channel will need to be added to VLANs as tagged. If you have any standalone non-virtualized servers – you’ll use untagged.

# conf t
# interface vlan 120
# description Management
# tagged Te0/1-4
# tagged Te2/1-4
# tagged Po1
# no shutdown
# copy run start

I have four hosts. Each host has a dual-port NIC which connects to two fabrics – switches 0 and 2 in the stack (1 port per fabric). I allow VLAN 120 traffic from these ports through the port channel to the upstream core switch.

You’ll most likely have more than one VLAN. At least one for Management and one for Production if it’s vSphere. But process for the rest is exactly the same.

The other switch

Just to give you a whole picture I’ll include the configuration of the switch on the other side of the trunk. I had a modular HP switch with 10Gb modules. A config for it would look like the following:

# conf t
# trunk I1-I8 trk1 lacp
# vlan 120 tagged trk1
# write mem

I1 to I8 here are ports, where I – is the module and 1 to 8 are ports within that module.

vSphere Dump / Syslog Collector: PowerCLI Script

March 12, 2015

Overview

If you install ESXi hosts on say 2GB flash cards in your blades which are smaller than required 6GB, then you won’t have what’s called persistent storage on your hosts. Both your kernel dumps and logs will be kept on RAM drive and deleted after a reboot. Which is less than ideal.

You can use vSphere Dump Collector and Syslog Collector to redirect them to another host. Usually vCenter machine, if it’s not an appliance.

If you have a bunch of ESXi hosts you’ll have to manually go through each one of them to set the settings, which might be a tedious task. Syslog can be done via Host Profiles, but Enterprise Plus licence is not a very common things across the customers. The simplest way is to use PowerCLI.

Amendments to the scripts

These scripts originate from Mike Laverick’s blog. I didn’t write them. Original blog post is here: Back To Basics: Installing Other Optional vCenter 5.5 Services.

The purpose of my post is to make a few corrections to the original Syslog script, as it has a few mistakes:

First – typo in system.syslog.config.set() statement. It requires additional $null argument before the hostname. If you run it as is you will probably get an error which looks like this.

Message: A specified parameter was not correct.
argument[0];
InnerText: argument[0]

Second – you need to open outgoing syslog ports, otherwise traffic won’t flow. It seems that Dump Collector traffic is enabled by default even though there is no rule for it in the firewall (former netDump rule doesn’t exist anymore). Odd, but that’s how it is. Syslog on the other hand requires explicit rule, which is reflected in the script by network.firewall.ruleset.set() command.

Below are the correct versions of both scripts. If you copy and paste them everything should just work.

vSphere Dump Collector

Foreach ($vmhost in (get-vmhost))
{
$esxcli = Get-EsxCli -vmhost $vmhost
$esxcli.system.coredump.network.get()
}

Foreach ($vmhost in (get-vmhost))
{
$esxcli = Get-EsxCli -vmhost $vmhost
$esxcli.system.coredump.network.set($null, “vmk0”, “10.0.0.1”, “6500”)
$esxcli.system.coredump.network.set($true)
}

vSphere Syslog Collector

Foreach ($vmhost in (get-vmhost))
{
$esxcli = Get-EsxCli -vmhost $vmhost
$esxcli.system.syslog.config.get()
}

Foreach ($vmhost in (get-vmhost))
{
$esxcli = Get-EsxCli -vmhost $vmhost
$esxcli.system.syslog.config.set($null, $null, $null, $null, $null, “udp://10.0.0.1:514”)
$esxcli.network.firewall.ruleset.set($null, $true, “syslog”)
$esxcli.system.syslog.reload()
}

Zerto Overview

March 6, 2014

zerto-logoZerto is a VM replication product which works on a hypervisor level. In contrast to array level replication, which SRM has been using for a long time, it eliminates storage array from the equation and all the complexities which used to come along with it (SRAs, splitting the LUNs for replicated and non-replicated VMs, potential incompatibilities between the orchestrated components, etc).

Basic Operation

Zerto consists of two components: ZVM (Zerto Virtual Manger) and VRA (Virtual Replication Appliance). VRAs are VMs that need to be installed on each ESXi host within the vCenter environment (performed in automated fashion from within ZVM console). ZVM manages VRAs and all the replication settings and is installed one per vCenter. VRA mirrors protected VMs I/O operations to the recovery site. VMs are grouped in VPGs (Virtual Protection Groups), which can be used as a consistency group or just a container.

Protected VMs can be preseeded  to DR site. But what Zerto essentially does is it replicates VM disks to any datastore on recovery site where you point it to and then tracks changes in what is called a journal volume. Journal is created for each VM and is kept as a VMDK within the “ZeRTO volumes” folder on a target datastore. Every few seconds Zerto creates checkpoints on a journal, which serve as crash consistent recovery points. So you can recover to any point in time, with a few seconds granularity. You can set the journal length in hours, depending on how far you potentially would want to go back. It can be anywhere between 1 and 120 hours.Data-Replication-over-WAN

VMs are kept unregistered from vCenter on DR site and VM configuration data is kept in Zerto repository. Which essentially means that if an outage happens and something goes really wrong and Zerto fails to bring up VMs on DR site you will need to recreate VMs manually. But since VMDKs themselves are kept in original format you will still be able to attach them to VMs and power them on.

Failover Scenarios

There are four failover scenarios within Zerto:

  • Move Operation – VMs are shut down on production site, unregistered from inventory, powered on at DR site and protection is reversed if you decide to do so. If you choose not to reverse protection, VMs are completely removed from production site and VPG is marked as “Needs Configuration”. This scenario can be seen as a planned migration of VMs between the sites and needs both sites to be healthy and operational.
  • Failover Operation – is used in disaster scenario when production site might be unavailable. In this case Zerto brings up protected VMs on DR site, but it does not try to remove VMs from production site inventory and leave them as is. If production site is still accessible you can optionally select to shutdown VMs. You cannot automatically reverse protection in this scenario, VPG is marked as “Needs Configuration” and can be activated later. And when it is activated, Zerto does all the clean up operations on the former production site: shuts down VMs (if they haven’t been already), unregister them from inventory and move to VRA folder on the datastore.
  • Failover Test Operation – this is for failover testing and brings up VMs on DR site in a configured bubble network which is normally not uplinked to any physical network. VMs continue to run on both sites. Note that VMs disk files in this scenario are not moved to VMs folders (as in two previous scenarios) and are just connected from VRA VM folder. You would also notice that Zerto created second journal volume which is called “scratch” journal. Changes to the VM that is running on DR site are saved to this journal while it’s being tested.
  • Clone Operation – VMs are cloned on DR site and connected to network. VMs are not automatically powered on to prevent potential network conflicts. This can be used for instance in DR site testing, when you want to check actual networking connectivity, instead of connecting VMs to an isolated network. Or for implementing backups, cloned environment for applications testing, etc.

Zerto Journal Sizing

By default journal history is configured as 4 hours and journal size is unlimited. Depending on data change rate within the VM journal can be smaller or larger. 15GB is approximately enough storage to support a virtual machine with 1TB of storage, assuming a 10% change rate per day with four hours of journal history saved. Zerto has a Journal Sizing Tool which helps to size journals. You can create a separate journal datastore as well.

Zerto compared to VMware Replication and SRM

There are several replication products in the market from VMware. Standalone VMware replication, VMware replication + SRM orchestraion and SRM array-based replication. If you want to know more on how they compare to Zerto, you can read the articles mentioned in references below. One apparent Zerto advantage, which I want to mention here, is integration with vCloud Director, which is essential for cloud providers who offer DRaaS solutions. SRM has no vCloud Director support.

References