Posts Tagged ‘CPU’

How Admission Control Really Works

May 2, 2016

confusionThere is a moment in every vSphere admin’s life when he faces vSphere Admission Control. Quite often this moment is not the most pleasant one. In one of my previous posts I talked about some of the common issues that Admission Control may cause and how to avoid them. And quite frankly Admission Control seems to do more harm than good in most vSphere environments.

Admission Control is a vSphere feature that is built to make sure that VMs with reservations can be restarted in a cluster if one of the cluster hosts fails. “Reservations” is the key word here. There is a common belief that Admission Control protects all other VMs as well, but that’s not true.

Let me go through all three vSphere Admission Control policies and explain why you’re better of disabling Admission Control altogether, as all of these policies give you little to no benefit.

Host failures cluster tolerates

This policy is the default when you deploy a vSphere cluster and policy which causes the most issues. “Host failures cluster tolerates” uses slots to determine if a VM is allowed to be powered on in a cluster. Depending on whether VM has CPU and memory reservations configured it can use one or more slots.

Slot Size

To determine the total number of slots for a cluster, Admission Control uses slot size. Slot size is either the default 32MHz and 128MB of RAM (for vSphere 6) or if you have VMs in the cluster configured with reservations, then the slot size will be calculated based on the maximum CPU/memory reservation. So say if you have 100 VMs, 98 of which have no reservations, one VM has 2 vCPUs and 8GB of memory reserved and another VM has 4 vCPUs and 4GB of memory reserved, then the slot size will jump from 32MHz / 128MB to 4 vCPUs / 8GB of memory. If you have 2.0 GHz CPUs on your hosts, the 4 vCPU reservation will be an equivalent of 8.0 GHz.

Total Number of Slots

Now that we know the slot size, which happens to be 8.0 GHz and 8GB of memory, we can calculate the total number of slots in the cluster. If you have 2 x 8 core CPUs and 256GB of RAM in each of 4 ESXi hosts, then your total amount of resources is 16 cores x 2.0 GHz x 4 hosts = 128 GHz and 256GB x 4 hosts = 1TB of RAM. If your slot size is 4 vCPUs and 8GB of RAM, you get 64 vCPUs / 4 vCPUs = 16 slots (you’ll get more for memory, but the least common denominator has to be used).

total_slots

Practical Use

Now if you configure to tolerate one host failure, you have to subtract four slots from the total number. Every VM, even if it doesn’t have reservations takes up one slot. And as a result you can power on maximum 12 VMs on your cluster. How does that sound?

Such incredibly restrictive behaviour is the reason why almost no one uses it in production. Unless it’s left there by default. You can manually change the slot size, but I have no knowledge of an approach one would use to determine the slot size. That’s the policy number one.

Percentage of cluster resources reserved as failover spare capacity

This is the second policy, which is commonly recommended by most to use instead of the restrictive “Host failures cluster tolerates”. This policy uses percentage-based instead of the slot-based admission.

It’s much more straightforward, you simply specify the percentage of resources you want to reserve. For example if you have four hosts in a cluster the common belief is that if you specify 25% of CPU and memory, they’ll be reserved to restart VMs in case one of the hosts fail. But it won’t. Here’s the reason why.

When calculating amount of free resources in a cluster, Admission Control takes into account only VM reservations and memory overhead. If you have no VMs with reservations in your cluster then HA will be showing close to 99% of free resources even if you’re running 200 VMs.

failover_capacity

For instance, if all of your VMs have 4 vCPUs and 8GB of RAM, then memory overhead would be 60.67MB per VM. For 300 VMs it’s roughly 18GB. If you have two VMs with reservations, say one VM with 2 vCPUs / 4GB of RAM and another VM with 4 vCPUs / 2GB of RAM, then you’ll need to add up your reservations as well.

So if we consider memory, it’s 18GB + 4GB + 2GB = 24GB. If you have the total of 1TB of RAM in your cluster, Admission Control will consider 97% of your memory resources being free.

For such approach to work you’d need to configure reservations on 100% of your VMs. Which obviously no one would do. So that’s the policy number two.

Specify failover hosts

This is the third policy, which typically is the least recommended, because it dedicates a host (or multiple hosts) specifically just for failover. You cannot run VMs on such hosts. If you try to vMotion a VM to it, you’ll get an error.

failover_host

In my opinion, this policy would actually be the most useful for reserving cluster resources. You want to have N+1 redundancy, then reserve it. This policy does exactly that.

Conclusion

When it comes to vSphere Admission Control, everyone knows that “Host failures cluster tolerates” policy uses slot-based admission and is better to be avoided.

There’s a common misconception, though, that “Percentage of cluster resources reserved as failover spare capacity” is more useful and can reserve CPU and memory capacity for host failover. But in reality it’ll let you run as many VMs as you want and utilize all of your cluster resources, except for the tiny amount of CPU and memory for a handful of VMs with reservations you may have in your environment.

If you want to reserve failover capacity in your cluster, either use “Specify failover hosts” policy or simply disable Admission Control and keep an eye on your cluster resource utilization manually (or using vROps) to make sure you always have room for growth.

Advertisement

Implications of Ignoring vSphere Admission Control

April 5, 2016

no-admissionHA Admission Control has historically been on of the lesser understood vSphere topics. It’s not intuitive how it works and what it does. As a result it’s left configured with default values in most vSphere environments. But default Admission Control setting are very restrictive and can often cause issues.

In this blog post I want to share the two most common issues with vSphere Admission Control and solutions to these issues.

Issue #1: Not being able to start a VM

Description

Probably the most common issue everyone encounters with Admission Control is when you suddenly cannot power on VMs any more. There are multiple reasons why that might happen, but most likely you’ve just configured a reservation on one of your VMs or deployed a VM from an OVA template with a pre-configured reservation. This has triggered a change in Admission Control slot size and based on the new slot size you no longer have enough slots to satisfy failover requirements.

As a result you get the following alarm in vCenter: “Insufficient vSphere HA failover resources”. And when you try to create and boot a new VM you get: “Insufficient resources to satisfy configured failover level for vSphere HA”.

admission_error

Cause

So what exactly has happened here. In my example a new VM with 4GHz of CPU and 4GB of RAM was deployed. Admission Control was set to its default “Host Failures Cluster Tolerates” policy. This policy uses slot sizes. Total amount of resources in the cluster is divided by the slot size (4GHz and 4GB in the above case) and then each VM (even if it doesn’t have a reservation) uses at least 1 slot. Once you configure a VM reservation, depending on the number of VMs in your cluster more often than not you get all slots being used straight away. As you can see based on the calculations I have 91 slots in the cluster, which have instantly been used by 165 running VMs.

slot_calculations

Solution

You can control the slot size manually and make it much smaller, such as 1GHz and 1GB of RAM. That way you’d have much more slots. The VM from my previous example would use four slots. And all other VMs which have no reservations would use less slots in total, because of a smaller slot size. But this process is manual and prone to error.

The better solution is to use “Percentage of Cluster Resources” policy, which is recommended for most environments. We’ll go over the main differences between the three available Admission Control policies after we discuss the second issue.

Issue #2: Not being able to enter Maintenance Mode

Description

It might be a corner case, but I still see it quite often. It’s when you have two hosts in a cluster (such as ROBO, DR or just a small environment) and try to put one host into maintenance mode.

The first issue you will encounter is that VMs are not automatically vMotion’ed to other hosts using DRS. You have to evacuate VMs manually.

And then once you move all VMs to the other host and put it into maintenance mode, you again can no longer power on VMs and get the same error: “Insufficient resources to satisfy configured failover level for vSphere HA”.

poweron_fail

Cause

This happens because disconnected hosts and hosts in maintenance mode are not used in Admission Control calculations. And one host is obviously not enough for failover, because if it fails, there are no other hosts to fail over to.

Solution

If you got caught up in such situation you can temporarily disable Admission Control all together until you finish maintenance. This is the reason why it’s often recommended to have at least 3 hosts in a cluster, but it can not always be justified if you have just a handful of VMs.

Alternatives to Slot Size Admission Control

There are another two Admission Control policies. First is “Specify a Failover Host”, which dedicates a host (or hosts) for failover. Such host acts as a hot standby and can run VMs only in a failover situation. This policy is ideal if you want to reserve failover resources.

And the second is “Percentage of Cluster Resources”. Resources under this policy are reserved based on the percentage of total cluster resources. If you have five hosts in your cluster you can reserve 20% of resources (which is equal to one host) for failover.

This policy uses percentage of cluster resources, instead of slot sizes, and hence doesn’t have the issues of the “Host Failures Cluster Tolerates” policy. There is a gotcha, if you add another five hosts to your cluster, you will need to change reservation to 10%, which is often overlooked.

Conclusion

“Percentage of Cluster Resources” policy is recommended to use in most cases to avoid issues with slot sizes. What is important to understand is that the goal of this policy is just to guarantee that VMs with reservations can be restarted in a host failure scenario.

If a VM has no reservations, then “Percentage of Cluster Resources” policy will use only memory overhead of this VM in its calculations. Which is probably the most confusing part about Admission Control in general. But that’s a topic for the next blog post.

 

How to move aggregates between NetApp controllers

September 25, 2013

Stop Sign_91602

 

DISCLAMER: I ACCEPT NO RESPONSIBILITY FOR ANY DAMAGE OR CORRUPTION OF DATA THAT MAY OCCUR AS A RESULT OF CARRYING OUT STEPS DESCRIBED BELOW. YOU DO THIS AT YOUR OWN RISK.

 

We had an issue with high CPU usage on one of the NetApp controllers servicing a couple of NFS datastores to VMware ESX cluster. HA pair of FAS2050 had two shelves, both of them owned by the first controller. The obvious solution for us was to reassign disks from one of the shelves to the other controller to balance the load. But how do you do this non-disruptively? Here is the plan.

In our setup we had two controllers (filer1, filer2), two shelves (shelf1, shelf2) both assigned to filer1. And two aggregates, each on its own shelf (aggr0 on shelf0, aggr1 on shelf1). Say, we want to reassign disks from shelf2 to filer2.

First step is to migrate all of the VMs from the shelf2 to shelf1. Because operation is obviously disruptive to the hosts accessing data from the target shelf. Once all VMs are evacuated, offline all volumes and an aggregate, to prevent any data corruption (you can’t take aggregate offline from online state, so change it to restricted first).

If you prefer to reassign disks in two steps, as described in NetApp Professional Services Tech Note #021: Changing Disk Ownership, don’t forget to disable automatic ownership assignment on both controllers, otherwise disks will be assigned back to the same controller again, right after you unown them:

> options disk.auto_assign off

It’s not necessary if you change ownership in one step as shown below.

Next step is to actually reassign the disks. Since they are already part of an aggregate you will need to force the ownership change:

filer1> disk assign 1b.01.00 -o filer2 -f

filer1> disk assign 1b.01.01 -o filer2 -f

filer1> disk assign 1b.01.nn -o filer2 -f

If you do not force disk reassignment you will get an error:

Assign request failed for disk 1b.01.0. Reason:Disk is part of a failed or offline aggregate or volume. Changing its owner may prevent aggregate or volume from coming back online. Ownership may be changed only by using the appropriate force option.

When all disks are moved across to filer2, new aggregate will show up in the list of aggregates on filer2 and you’ll be able to bring it online. If you can’t see the aggregate, force filer to rescan the drives by running:

filer2> disk show

The old aggregate will still be seen in the list on filer1. You can safely remove it:

filer1> aggr destroy aggr1

NetApp SnapMirror Optimization

May 31, 2013

gzipSnapMirroring to disaster recovery site requires huge amount of data to be transferred over the WAN link. In some cases replication can significantly lag from the defined schedule. There are two ways to reduce the amount of traffic and speed up replication: deduplication and compression.

If you apply deduplication to the replicated volumes, you simply reduce the amount of data you need to be transferred. You can read how to enable deduplication in my previous post.

Compression is a less known feature of SnapMirror. What it does is compression of the data being transferred on the source and decompression on the destination. Data inside the volume is left intact.

To enable SnapMirror compression you first need to make sure, that all your connections in snapmirror.conf file have names, like:

connection_name=multi(src_system,dst_system)

Then use ‘compression=enable’ configuration option to enable it for particular SnapMirror:

connection_name:src_vol dst_system:dst_vol compression=enable 0 2 * *

To check the compression ration after the transfer has been finished run:

> snapmirror status -l

And look at ‘Compression Ratio’ line:

Source: fas1:src
Destination: fas2:dest
Status: Transferring
Progress: 24 KB
Compression Ratio: 3.5 : 1

The one drawback of compression is an increased CPU load. Monitor your CPU load and if it’s too high, use compression selectively.

NetApp Deduplication in a Nutshell

May 12, 2013

NetApp-Dedupe2NetApp uses Write Anywhere File Layout (WAFL) filesystem which is a key for NetApp’s efficient snapshot technology. If you’re already familiar with how snapshots are implemented in Data ONTAP, then understanding underlying mechanisms of deduplication is simple. Filer calculates hash for each data block it receives and preserves this in the form of metadata on the volume level. Then according to deduplication schedule (usually on weekends), filer runs metadata processing and for each duplicate hash changes data pointer to the original block of data.

Since for each data block filer needs to calculate hash on the fly, it has its penalty. On systems with CPU loaded by less than 50% performance impact is negligible. For heavily loaded systems, where CPU is nearly 100% busy, performance impact can be around 15%. For high-end 6000 systems penalty can jump up to 35% for random writes. Heavy sequential read operations can also suffer from deduplication, because read operations can be rerouted in random way across physical storage, depending on where the original data block actually is. In general, deduplication has low impact on system performance. But you can’t use it blindly and should keep in mind that in particular cases it can slow down your storage system.

Deduplication configuration is pretty simple. First of all, you need to activate deduplication on particular volume:

> sis on “targetvol”

If you already have data on the volume, you need to scan it. Otherwise, it won’t be deduped. It’s a common mistake. Deduplication is a low priority task, but keep in mind, however, that it can slightly impact your storage performance when done during business hours, especially if you run deduplication for several volumes simultaneously.

> sis start -s “targetvol”

To show the status of deduplication for particular volume:

> sis status “targetvol”

To see the deduplication schedule:

> sis config “targetvol”

And the most pleasant command to find out how much data you’ve saved:

> df -s “targetvol”

If you want to undo deduplication, first switch it off and then undo it using the following commands:

> sis off “targetvol”
> priv set advanced
*> sis undo “targetvol”
*> priv set

I, personally, was able to achieve 40% deduplication rate for VMware VMFS datastores, which is rather impressive, considering these were mixed OS + application data LUNs.

As a final note, I would like to point out, that deduplication is suitable only for environments with high percentage of similar data. VMware is a good example of it. You won’t get any significant deduplication ratio for swap file volumes, Exchnage mailboxes or Symantec Enterprise Vault which is already deduped.

Further reading:

TR-3958: NetApp Data Compression and Deduplication Deployment and Implementation Guide: Data ONTAP 8.1 Operating in 7-Mode

Jumbo Frames justified?

March 27, 2012

When it comes to VMware on NetApp, boosting  performance by implementing Jumbo Frames is always taken into consideration. However, it’s not clear if it really has any significant impact on latency and throughput.

Officially VMware doesn’t support Jumbo Frames for NAS and iSCSI. It means that using Jumbo Frames to transfer storage traffic from VMkernel interface to your storage system is the solution which is not tested by VMware, however, it actually works. To use Jumbo Frames you need to activate them throughout the whole communication path: OS, virtual NIC (change to Enchanced vmxnet from E1000), Virtual Switch and VMkernel, physical ethernet switch and storage. It’s a lot of work to do and it’s disruptive at some points, which is not a good idea for production infrastructure. So I decided to take a look at benchmarks, before deciding to spend a great amount of time and effort on it.

VMware and NetApp has a TR-3808-0110 technical report which is called “VMware vSphere and ESX 3.5 Multiprotocol Performance Comparison Using FC, iSCSI, and NFS”. Section 2.2 clearly states that:

  • Using NFS with jumbo frames enabled using both Gigabit and 10GbE generated overall performance that was comparable to that observed using NFS without jumbo frames and required approximately 6% to 20% fewer ESX CPU resources compared to using NFS without jumbo frames, depending on the test configuration.
  • Using iSCSI with jumbo frames enabled using both Gigabit and 10GbE generated overall performance that was comparable to slightly lower than that observed using iSCSI without jumbo and required approximately 12% to 20% fewer ESX CPU resources compared to using iSCSI without jumbo frames depending on the test configuration.
Another important statement here is:
  • Due to the smaller request sizes used in the workloads, it was not expected that enabling jumbo frames would improve overall performance.

I believe that 4K and 8K packet sizes are fair in case of virtual infrastructure. Maybe if you move large amounts of data through your virtual machines it will make sense for you, but I feel like it’s not reasonable to implement Jumbo Frames for virual infrastructure in general.

The another report finding is that Jumbo Frames decrease CPU load, but if you use TOE NICs, then no sense once again.

VMware supports jumbo frames with the following NICs: Intel (82546, 82571), Broadcom (5708, 5706, 5709), Netxen (NXB-10GXxR, NXB-10GCX4), and Neterion (Xframe, Xframe II, Xframe E). We use Broadcom NetXtreme II BCM5708 and Intel 82571EB, so Jumbo Frames implementation is not going to be a problem. Maybe I’ll try to test it by myself when I’ll have some free time.

Links I found useful: