Posts Tagged ‘virtual machine’

How Admission Control Really Works

May 2, 2016

confusionThere is a moment in every vSphere admin’s life when he faces vSphere Admission Control. Quite often this moment is not the most pleasant one. In one of my previous posts I talked about some of the common issues that Admission Control may cause and how to avoid them. And quite frankly Admission Control seems to do more harm than good in most vSphere environments.

Admission Control is a vSphere feature that is built to make sure that VMs with reservations can be restarted in a cluster if one of the cluster hosts fails. “Reservations” is the key word here. There is a common belief that Admission Control protects all other VMs as well, but that’s not true.

Let me go through all three vSphere Admission Control policies and explain why you’re better of disabling Admission Control altogether, as all of these policies give you little to no benefit.

Host failures cluster tolerates

This policy is the default when you deploy a vSphere cluster and policy which causes the most issues. “Host failures cluster tolerates” uses slots to determine if a VM is allowed to be powered on in a cluster. Depending on whether VM has CPU and memory reservations configured it can use one or more slots.

Slot Size

To determine the total number of slots for a cluster, Admission Control uses slot size. Slot size is either the default 32MHz and 128MB of RAM (for vSphere 6) or if you have VMs in the cluster configured with reservations, then the slot size will be calculated based on the maximum CPU/memory reservation. So say if you have 100 VMs, 98 of which have no reservations, one VM has 2 vCPUs and 8GB of memory reserved and another VM has 4 vCPUs and 4GB of memory reserved, then the slot size will jump from 32MHz / 128MB to 4 vCPUs / 8GB of memory. If you have 2.0 GHz CPUs on your hosts, the 4 vCPU reservation will be an equivalent of 8.0 GHz.

Total Number of Slots

Now that we know the slot size, which happens to be 8.0 GHz and 8GB of memory, we can calculate the total number of slots in the cluster. If you have 2 x 8 core CPUs and 256GB of RAM in each of 4 ESXi hosts, then your total amount of resources is 16 cores x 2.0 GHz x 4 hosts = 128 GHz and 256GB x 4 hosts = 1TB of RAM. If your slot size is 4 vCPUs and 8GB of RAM, you get 64 vCPUs / 4 vCPUs = 16 slots (you’ll get more for memory, but the least common denominator has to be used).

total_slots

Practical Use

Now if you configure to tolerate one host failure, you have to subtract four slots from the total number. Every VM, even if it doesn’t have reservations takes up one slot. And as a result you can power on maximum 12 VMs on your cluster. How does that sound?

Such incredibly restrictive behaviour is the reason why almost no one uses it in production. Unless it’s left there by default. You can manually change the slot size, but I have no knowledge of an approach one would use to determine the slot size. That’s the policy number one.

Percentage of cluster resources reserved as failover spare capacity

This is the second policy, which is commonly recommended by most to use instead of the restrictive “Host failures cluster tolerates”. This policy uses percentage-based instead of the slot-based admission.

It’s much more straightforward, you simply specify the percentage of resources you want to reserve. For example if you have four hosts in a cluster the common belief is that if you specify 25% of CPU and memory, they’ll be reserved to restart VMs in case one of the hosts fail. But it won’t. Here’s the reason why.

When calculating amount of free resources in a cluster, Admission Control takes into account only VM reservations and memory overhead. If you have no VMs with reservations in your cluster then HA will be showing close to 99% of free resources even if you’re running 200 VMs.

failover_capacity

For instance, if all of your VMs have 4 vCPUs and 8GB of RAM, then memory overhead would be 60.67MB per VM. For 300 VMs it’s roughly 18GB. If you have two VMs with reservations, say one VM with 2 vCPUs / 4GB of RAM and another VM with 4 vCPUs / 2GB of RAM, then you’ll need to add up your reservations as well.

So if we consider memory, it’s 18GB + 4GB + 2GB = 24GB. If you have the total of 1TB of RAM in your cluster, Admission Control will consider 97% of your memory resources being free.

For such approach to work you’d need to configure reservations on 100% of your VMs. Which obviously no one would do. So that’s the policy number two.

Specify failover hosts

This is the third policy, which typically is the least recommended, because it dedicates a host (or multiple hosts) specifically just for failover. You cannot run VMs on such hosts. If you try to vMotion a VM to it, you’ll get an error.

failover_host

In my opinion, this policy would actually be the most useful for reserving cluster resources. You want to have N+1 redundancy, then reserve it. This policy does exactly that.

Conclusion

When it comes to vSphere Admission Control, everyone knows that “Host failures cluster tolerates” policy uses slot-based admission and is better to be avoided.

There’s a common misconception, though, that “Percentage of cluster resources reserved as failover spare capacity” is more useful and can reserve CPU and memory capacity for host failover. But in reality it’ll let you run as many VMs as you want and utilize all of your cluster resources, except for the tiny amount of CPU and memory for a handful of VMs with reservations you may have in your environment.

If you want to reserve failover capacity in your cluster, either use “Specify failover hosts” policy or simply disable Admission Control and keep an eye on your cluster resource utilization manually (or using vROps) to make sure you always have room for growth.

Advertisement

Implications of Ignoring vSphere Admission Control

April 5, 2016

no-admissionHA Admission Control has historically been on of the lesser understood vSphere topics. It’s not intuitive how it works and what it does. As a result it’s left configured with default values in most vSphere environments. But default Admission Control setting are very restrictive and can often cause issues.

In this blog post I want to share the two most common issues with vSphere Admission Control and solutions to these issues.

Issue #1: Not being able to start a VM

Description

Probably the most common issue everyone encounters with Admission Control is when you suddenly cannot power on VMs any more. There are multiple reasons why that might happen, but most likely you’ve just configured a reservation on one of your VMs or deployed a VM from an OVA template with a pre-configured reservation. This has triggered a change in Admission Control slot size and based on the new slot size you no longer have enough slots to satisfy failover requirements.

As a result you get the following alarm in vCenter: “Insufficient vSphere HA failover resources”. And when you try to create and boot a new VM you get: “Insufficient resources to satisfy configured failover level for vSphere HA”.

admission_error

Cause

So what exactly has happened here. In my example a new VM with 4GHz of CPU and 4GB of RAM was deployed. Admission Control was set to its default “Host Failures Cluster Tolerates” policy. This policy uses slot sizes. Total amount of resources in the cluster is divided by the slot size (4GHz and 4GB in the above case) and then each VM (even if it doesn’t have a reservation) uses at least 1 slot. Once you configure a VM reservation, depending on the number of VMs in your cluster more often than not you get all slots being used straight away. As you can see based on the calculations I have 91 slots in the cluster, which have instantly been used by 165 running VMs.

slot_calculations

Solution

You can control the slot size manually and make it much smaller, such as 1GHz and 1GB of RAM. That way you’d have much more slots. The VM from my previous example would use four slots. And all other VMs which have no reservations would use less slots in total, because of a smaller slot size. But this process is manual and prone to error.

The better solution is to use “Percentage of Cluster Resources” policy, which is recommended for most environments. We’ll go over the main differences between the three available Admission Control policies after we discuss the second issue.

Issue #2: Not being able to enter Maintenance Mode

Description

It might be a corner case, but I still see it quite often. It’s when you have two hosts in a cluster (such as ROBO, DR or just a small environment) and try to put one host into maintenance mode.

The first issue you will encounter is that VMs are not automatically vMotion’ed to other hosts using DRS. You have to evacuate VMs manually.

And then once you move all VMs to the other host and put it into maintenance mode, you again can no longer power on VMs and get the same error: “Insufficient resources to satisfy configured failover level for vSphere HA”.

poweron_fail

Cause

This happens because disconnected hosts and hosts in maintenance mode are not used in Admission Control calculations. And one host is obviously not enough for failover, because if it fails, there are no other hosts to fail over to.

Solution

If you got caught up in such situation you can temporarily disable Admission Control all together until you finish maintenance. This is the reason why it’s often recommended to have at least 3 hosts in a cluster, but it can not always be justified if you have just a handful of VMs.

Alternatives to Slot Size Admission Control

There are another two Admission Control policies. First is “Specify a Failover Host”, which dedicates a host (or hosts) for failover. Such host acts as a hot standby and can run VMs only in a failover situation. This policy is ideal if you want to reserve failover resources.

And the second is “Percentage of Cluster Resources”. Resources under this policy are reserved based on the percentage of total cluster resources. If you have five hosts in your cluster you can reserve 20% of resources (which is equal to one host) for failover.

This policy uses percentage of cluster resources, instead of slot sizes, and hence doesn’t have the issues of the “Host Failures Cluster Tolerates” policy. There is a gotcha, if you add another five hosts to your cluster, you will need to change reservation to 10%, which is often overlooked.

Conclusion

“Percentage of Cluster Resources” policy is recommended to use in most cases to avoid issues with slot sizes. What is important to understand is that the goal of this policy is just to guarantee that VMs with reservations can be restarted in a host failure scenario.

If a VM has no reservations, then “Percentage of Cluster Resources” policy will use only memory overhead of this VM in its calculations. Which is probably the most confusing part about Admission Control in general. But that’s a topic for the next blog post.

 

NetApp VSC Single File Restore Explained

August 5, 2013

netapp_dpIn one of my previous posts I spoke about three basic types of NetApp Virtual Storage Console restores: datastore restore, VM restore and backup mount. The last and the least used feature, but very underrated, is the Single File Restore (SFR), which lets you restore single files from VM backups. You can do the same thing by mounting the backup, connecting vmdk to VM and restore files. But SFR is a more convenient way to do this.

Workflow

SFR is pretty much an out-of-the-box feature and is installed with VSC. When you create an SFR session, you specify an email address, where VSC sends an .sfr file and a link to Restore Agent. Restore Agent is a separate application which you install into VM, where you want restore files to (destination VM). You load the .sfr file into Restore Agent and from there you are able to mount source VM .vmdks and map them to OS.

VSC uses the same LUN cloning feature here. When you click “Mount” in Restore Agent – LUN is cloned, mapped to an ESX host and disk is connected to VM on the fly. You copy all the data you want, then click “Dismount” and LUN clone is destroyed.

Restore Types

There are two types of SFR restores: Self-Service and Limited Self-Service. The only difference between them is that when you create a Self-Service session, user can choose the backup. With Limited Self-Service, backup is chosen by admin during creation of SFR session. The latter one is used when destination VM doesn’t have connection to SMVI server, which means that Remote Agent cannot communicate with SMVI and control the mount process. Similarly, LUN clone is deleted only when you delete the SFR session and not when you dismount all .vmdks.

There is another restore type, mentioned in NetApp documentation, which is called Administartor Assisted restore. It’s hard to say what NetApp means by that. I think its workflow is same as for Self-Service, but administrator sends the .sfr link to himself and do all the job. And it brings a bit of confusion, because there is an “Admin Assisted” column on SFR setup tab. And what it actually does, I believe, is when Port Group is configured as Admin Assisted, it forces SFR to create a Limited Self-Service session every time you create an SFR job. You won’t have an option to choose Self-Assisted at all. So if you have port groups that don’t have connectivity to VSC, check the Admin Assisted option next to them.

Notes

Keep in mind that SFR doesn’t support VM’s with IDE drives. If you try to create SFR session for VMs which have IDE virtual hard drives connected, you will see all sorts of errors.

Monitoring ESX Storage Queues

July 30, 2013

6a00d8341c328153ef01774354e2fd970d-500wiQueue Limits

I/O data goes through several storage queues on its way to disk drives. VMware is responsible for VM queue, LUN queue and HBA queue. VM and LUN queues are usually equal to 32 operations. It means that each ESX host at any moment can have no more than 32 active operations to a LUN. Same is true for VMs. Each VM can have as many as 32 active operations to a datastore. And if multiple VMs share the same datastore, their combined I/O flow can’t go over the 32 operations limit (per LUN queue for QLogic HBAs has been increased from 32 to 64 operations in vSphere 5). HBA queue size is much bigger and can hold several thousand operations (4096 for QLogic, however I can see in my config that driver is configured with 1014 operations).

Queue Monitoring

You can monitor storage queues of ESX host from the console. Run “esxtop”, press “d” to view disk adapter stats, then press “f” to open fields selection and add Queue Stats by pressing “d”.

AQLEN column will show the queue depth of the storage adapter. CMDS/s is the real-time number of IOPS. DAVG is the latency which comes from the frame traversing through the “driver – HBA – fabric – array SP” path and should be less than 20ms. Otherwise it means that storage is not coping. KAVG shows the time which operation spent in hypervisor kernel queue and should be less than 2ms.

Press “u” to see disk device statistics. Press “f” to open the add or remove fields dialog and select Queue Stats “f”. Here you’ll see a number of active (ACTV) and queue (QUED) operations per LUN.  %USD is the queue load. If you’re hitting 100 in %USD and see operations under QUED column, then again it means that your storage cannot manage the load an you need to redistribute your workload between spindles.

Some useful documents:

Mounting VMware Virtual Disks

June 11, 2013

H_Storage04There are millions of posts on that topic all over the Internet. Just another repetition mostly for myself.

VMware has Virtual Disk Development Kit (VDDK) which is more of an API for backup software vendors. But it includes a handy tool called vmware-mount, which gives you an ability to mount VMware virtual disks (.vmdk) from wherever you want.

Download VDDK from VMware site. It’s free. And then run vmware-mount with the following keys:

> vmware-mount driveletter: “[vmfs_datastore] vmname/diskname.vmdk” /i:”datacentername/vm/vmname” /h:vcname /u:username /s:password

Choose drive letter, specify vmdk path, inventory path to VM (put ‘vm’ in lowercase between datacenter and vm name, upper case will give you an error) and vCenter or ESXi host name.

Note however, that you can mount only vmdks from powered off VMs. But there is a workaround. You can mount vmdk from online VMs in read-only mode if you make a VM snapshot. Then the original vmdk won’t be locked by ESXi server and you will be able to mount it.

To unmount a vmdk run:

> vmware-mount diskletter: /d

There are also several GUI tools to mount vmdks. But vmware-mount is enough for me.

Unexpected Deduplication Impact on VMware I/O Latency

May 28, 2013

NetApp deduplication is a postponed process. During normal operation Data ONTAP only calculates hashes for the data blocks. Actual deduplication is carried out off-hours as per configured schedule. Hash calculation doesn’t affect performance in most cases. I talked about that in my previous post. NetApp states in its documentation that deduplication is a low-priority process:

When one deduplication process is running, there is 0% to 15% performance degradation on other applications.

Once I faced a situation when deduplication was configured to be carried out during business hours on one of the volumes. No one noticed that at some point volume run out of space and Data ONTAP wasn’t able to perform deduplication from that time. Situation became worse when Data ONTAP was upgraded from version 7.3.2 to 8.1.0. Now during deduplication filer tried to upgrade the fingerprint metadata to a new version at 15:00 every day with the message: “Fingerprint is being upgraded” and failed. It seems that the metadata upgrade is a very resource-intensive process and heavily affects I/O latency.

This volume was not a VMware datastore, but it sit on the same aggregate together with the several VMFS LUNs. Here what happened to the VMware I/O latency every day at 15:00 (click to enlarge):

dedup_issue_ed

I deleted the host name and the datastores names from the graph. You can see the large latency spike, which won’t turn yourVMs into kernel panic, but it’s not the thing you would want your production environment to experience every day.

The solution was simple. After space was increased on this volume, deduplication metadata upgrade performed successfully and problem went away. Additionally, deduplication was shifted to off-hours.

The simple lesson to learn: don’t schedule deduplication during the day, you never know what could possibly go wrong.

Storwize V7000 with vSphere 5 storage configuration

December 1, 2012

storwizeInformation on how to configure Storwize for optimal performance is very scarce. I’ll try to build some understanding of it from bits an pieces gathered throughout the Internet and redbooks.

Barry Whyte gave many insights on Storwize internals in his blog. Particularly his “Configuring IBM Storwize V7000 and SVC for Optimal Performance” series of posts. I’ll quote him here. The main Storwize redbook “Implementing the IBM Storwize V7000 V6.3” is mostly an administration guide and gives no useful information on the topic. I find “SAN Volume Controller Best Practices and Performance Guidelines” way more helpful (Storwize firmware is built on SVC code).

Total Number of MDisks

That’s what Barry says:

… At the heart of each V7000 controller canister is an Intel Jasper Forrest (Sandy Bridge) based quad core CPU. … When we added the tried and trusted (SSA) DS8000 RAID functionality in 2010 (6.1.0) we therefore assigned RAID processing on a per mdisk basis to a single core. That means you need at least 4 arrays per V7000 to get maximal CPU core performance. …

Number of MDisks per Storage Pool

SVC Redbook:

The capability to stripe across disk arrays is the single most important performance advantage of the SVC; however, striping across more arrays is not necessarily better. The objective here is to only add as many arrays to a single Storage Pool as required to meet the performance objectives.

If the Storage Pool is already meeting its performance objectives, we recommend that, in most cases, you add the new MDisks to new Storage Pools rather than add the new MDisks to existing Storage Pools.

Table 5-1 shows the recommended number of arrays per Storage Pool that is appropriate for general cases.

Controller type       Arrays per Storage Pool
DS4000/DS5000         4 - 24
DS6000/DS8000         4 - 12
IBM Storwise V7000    4 - 12

The development recommendations for Storwize V7000 are summarized below:

  • One MDisk group per storage subsystem
  • One MDisk group per RAID array type (RAID 5 versus RAID 10)
  • One MDisk and MDisk group per disk type (10K versus 15K RPM, or 146 GB versus 300 GB)

There are situations where multiple MDisk groups are desirable:

  • Workload isolation
  • Short-stroking a production MDisk group
  • Managing different workloads in different groups

We recommend that you have at least two MDisk groups, one for key applications, another for everything else.

Number of LUNs per Storage Pool

SVC Redbook:

We generally recommend that you configure LUNs to use the entire array, which is especially true for midrange storage subsystems where multiple LUNs configured to an array have shown to result in a significant performance degradation. The performance degradation is attributed mainly to smaller cache sizes and the inefficient use of available cache, defeating the subsystem’s ability to perform “full stride writes” for Redundant Array of Independent Disks 5 (RAID 5) arrays. Additionally, I/O queues for multiple LUNs directed at the same array can have a tendency to overdrive the array.

Table 5-2 provides our recommended guidelines for array provisioning on IBM storage subsystems.

Controller type                     LUNs per array
IBM System Storage DS4000/DS5000    1
IBM System Storage DS6000/DS8000    1 - 2
IBM Storwize V7000                  1

General considerations

vsphere5-logoLets take a look at vSphere use case scenario on top of Storwize with 16 x 600GB SAS drives in control enclosure and 10 x 2TB NL-SAS in extension enclosure (our personal case).

First of all we need to decide how many arrays we need. Do we have different workloads? No. All storage will be assigned to virtual machines which have in general the same random read/write access pattern. Do we need to isolate workloads? Probably yes, it’s generally a good idea to separate highly critical production VMs from everything else. Do we have different drive types? Yes. Obviously we don’t want to mix drive types in one RAID. Are we going to make different RAID types? Again, yes. RAID 10 is appropriate on SAS and RAID 5 on NL-SAS. So two MDisks – one RAID 10 on SAS and one RAID 5 on NL-SAS would be enough. Storwize nodes have 4 cores each. It may seem that you would benefit from 4 MDisks, but in fact you won’t. Here what Barry says:

In the case where you only have 1 or 2 HDD arrays, then the core stuff doesn’t really come into play. Its only when you get to larger systems, where you are driving more I/O than a single RAID core can handle that you need to spread them.

This is also true if you are running all SSD arrays, so 24x SSD would be best split into 4 arrays to get maximum IOPs, whereas 24x HDD are not going to saturate a single core, so (if you could create a 23+P! [ you can’t 15+P is largest we support ] then it would perform as well as 2x 11+P etc

To storage pools. In our example we have two MDisks, so you simply make two storage pools. In future if you hit performance limit, you create additional MDisks and then you have two options. If each MDisk separately is able to sustain your performance requirements, you make additional storage pools and redistribute workload between them. If you have huge load on storage and even redistribution of VMs between two arrays doesn’t help, then you better combine two MDisks of each type in its own storage pool for striping between MDisks.

Same story for number of LUNs. IBM recommends one to one LUN to MDisk relationship. But read carefully. Recommendation comes from the fact that different workloads can clash and degrade array performance. But if we have generally the same I/O patterns coming to the array it’s safe to make several LUNs on it, until latency is in the acceptable range. Moreover, when it comes to vSphere and VMFS, it’s beneficial to have at least two volumes in terms of manageability. With several LUNs you will at least have an ability to move VMs between LUNs for reconfiguration purposes. Also keep in mind that ESXi 5 hypervisor limit each host to storage queue of depth 32 per LUN. It means that if you have one big LUN and many VMs running on the host, you can quickly reach queue limit. On the other hand do not create too many LUNs or you will oversubscribe storage processors (SPs).

Sample configuration

IBM recommends constructing both RAID 10 and RAID 5 arrays from 8 drives + 1 spare drive. But since we have 16 SAS and 10 NL-SAS I would launch CLI and create two arrays: one 14 drives + 2 spares RAID 10 and one 8 drives + 2 spares RAID 5 (or 9 drives + 1 spare, but it’s not a good idea to create RAID with uneven number of drives). Each RAID in its own pool. Several LUNs in each pool. I would go for 2TB LUNs.

Disk to Disk to Tape backup in Backup Exec

July 14, 2012

Notice: It seems that D2D2T feature in Backup Exec 11d is buggy. D2D2T duplicate jobs (which transfer data from disk to tape) are insanely slow and nobody has yet solved this problem. You can try to implement backup of raw Backup to Disk Folder, but it is associated with number of  difficulties when restoring. Files from Backup to Disk Folders are media and they conflict with media which is currently used for backup.

Typical backup solution in most organizations consists of backup server and tape drive/autoloader/tape library connected directly to backup server. Every night backups are pushed to tape through backup server. But sometimes it is more complicated. We have NetApp filer with StorageTek tape library connected to the filer. Backup server sends NDMP commands to the filer and filer in its turn performs actual data transfer to tapes from disk shelves. Most of our hosts are VMware virtual machines. We backup whole .vmdk files, but we also want to perform file-level backups from some of virtual machines. To accomplish that we set up backup agents on all virtual machines, but we can’t backup files directly to tapes, because they do not originate from NetApp filer volumes. So we decided to implement D2D2T multistage backup. The idea here is to create a CIFS share on the filer, backup data there and then transfer data from CIFS share to tapes.

First step here is to configure disk to disk backup. Backup Exec stores disk to disk backups in binary files. Folder where files are stored is listed on Devices tab and files are listed on Media tab. Initially, you need to create a Backup to Disk Folder in Devices tab. There you choose size for backup-to-disk files and maximum number of backups per backup-to-disk file. If backup is larger than file size, it is splitted in several files. If file size is smaller than backup, several backups will be written to one file. I use defaults with 16GB file size. Then you create backup jobs as usual (by configuring selection list and policy) using Backup to Disk Folder as target device.

As a second step you need to instruct Backup Exec to transfer backed up files to tape, upon disk to disk job completion. Backup Exec has “duplicate jobs” to implement that. Go to your backup policy properties, click “New Template”, choose “Duplicate Backup Sets Template”, pick template for which you want to create duplicate, in “Devices and Media” choose your tape library, in “Schedule” choose “Run only according to rules for this template”. This will create duplicate template and rule which will start duplicate job after main job completes. As a result you will have duplicate data on disk and on tape.

Consistent VMware snapshots on NetApp

March 16, 2012

If you use NetApp as a storage for you VMware hard drives, it’s wise to utilize NetApp’s powerful snapshot capabilities as an instant backup tool. I shortly mentioned in my previous post that you should disable default snapshot schedule. Snapshot is done very quickly on NetApp, but still it’s not instantaneous. If VM is running you can get .vmdks which have inconsistent data. Here I’d like to describe how you can perform consistent snapshots of VM hard drives which sit on NetApp volumes exported via NFS. Obviously it won’t work for iSCSI LUNs since you will have LUNs snapshots which are almost useless for backups.

What makes VMware virtualization platform far superior to other well-known solutions in the market is VI APIs. VI API is a set of Web services hosted on Virtual Center and ESX hosts that provides interfaces for all components and operations. Particularly, there is a Perl interface for VI API which is called VMware Infrastructure Perl Toolkit. You can download and install it for free. Using VI Perl Toolkit you can write a script which will every day put your VMs in a so called hot backup mode and make NetApp snapshots as well. Practically, hot backup mode is also a snapshot. When you create a VM snapshot, original VM hard drive is left intact and VMware starts to write delta in another file. It means that VM hard drive won’t change when making NetApp snapshot and you will get consistent .vmdk files. Now lets move to implementation.

I will write excerpts from the actual script here, because lines in the script are quite long and everything will be messed up on the blog page. I uploaded full script on FileDen. Here is the link. I apologize if you read this blog entry far later than it was published and my account or the FileDen service itself no longer exist.

VI Perl Toolkit is effectively a set of Perl scripts which you run as ready to use utilities. We will use snapshotmanager.pl which lets you create VMware VM snapshots. In the first step you make snapshots of all VMs:

\”$perl_path\perl\” -w \”$perl_toolkit_path\snapshotmanager.pl\” –server vc_ip –url https://vc_ip/sdk/vimService –username snapuser –password 123456  –operation create –snapshotname \”Daily Backup Shapshot\”

For the sake of security I created Snapshot Manager role and respective user account in Virtual Center with only two allowed operations: Create Snapshot and Remove Snapshot. Run line is self explanatory. I execute it using system($run_line) command.

After VM snapshots are created you make a NetApp snapshot:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap create vm_sata snap_name

To connect to NetApp terminal I use PuTTY ssh client. putty.exe itself has a GUI and plink.exe is for batch scripting. Using this command you create snapshot of particular NetApp volume. Those which hold .vmdks in our case.

To get all VMs from hot backup mode run:

\”$perl_path\perl\” -w \”$perl_toolkit_path\snapshotmanager.pl\” –server vc_ip –url https://vc_ip/sdk/vimService –username snapuser –password 123456  –operation remove –snapshotname \”Daily Backup Shapshot\”  –children 0

By –children 0 here we tell not to remove all children snapshots.

After we familiarized ourselves with main commands, lets move on to the script logic. Apparently you will want to have several snapshots. For example 7 of them for each day of the week. It means each day, before making new snapshot you will need to remove oldest and rename others. Renaming is just for clarity. You can name your snapshots vmsnap.1, vmsnap.2, … , vmsnap.7. Where vmsnap.7 is the oldest. Each night you put your VMs in hot backup mode and delete the oldest snapshot:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap delete vm_sata vmsnap.7

Then you rename other snapshots:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.6 vmsnap.7
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.5 vmsnap.6
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.4 vmsnap.5
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.3 vmsnap.4
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.2 vmsnap.3

And create the new one:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap create vm_sata vmsnap.1

As a last step you bring your VMs out of hot backup mode.

Using this technique you can create short term backups of your virtual infrastructure and use them for long term retention with help of standalone backup solutions. Like backing up data from snapshots to tape library using Symantec BackupExec. I’m gonna talk about this in my later posts.

SCVMM doesn’t add Hyper-V host

November 1, 2011

When adding Hyper-V host to Microsoft System Center Virtual Machine Manager I ran into error from Refresh host job:

Error (2912)
An internal error has occurred trying to contact an agent on the host.corp.contoso.com server.
(The remote procedure call failed (0x800706BE))

Recommended Action
Ensure the agent is installed and running. Ensure the WS-Management service is installed and running, then restart the agent

I was able to see host under Hosts panel but couldn’t see VMs. Firstly I thought that running SCVMM along with Hyper-V on the same server even for testing purposes is not a supported scenario. But it turns out it’s a bug. Solution is to install SP1 for Windows Server 2008 R2 in my case.