Posts Tagged ‘ESX’
September 26, 2017

Introduction
Since vSphere 5.1, VMware offers an easy migration path for VMs running on hosts managed by a vCenter. Using Enhanced vMotion available in Web Client, VMs can be migrated between hosts, even if they don’t have shared datastores. In vSphere 6.0 cross vCenter vMotion(xVC-vMotion) was introduced, which no longer requires you to even have old and new hosts be managed by the same vCenter.
But what if you don’t have a vCenter and you need to move VMs between standalone ESXi hosts? There are many tools that can do that. You can use V2V conversion in VMware Converter or replication feature of the free version of Veeam Backup and Replication. But probably the easiest tool to use is OVF Tool.
Tool Overview
OVF Tool has been around since Open Virtualization Format (OVF) was originally published in 2008. It’s constantly being updated and the latest version 4.2.0 supports vSphere up to version 6.5. The only downside of the tool is it can export only shut down VMs. It’s may cause problems for big VMs that take long time to export, but for small VMs the tool is priceless.
Installation
OVF Tool is a CLI tool that is distributed as an MSI installer and can be downloaded from VMware web site. One important thing to remember is that when you’re migrating VMs, OVF Tool is in the data path. So make sure you install the tool as close to the workload as possible, to guarantee the best throughput possible.
Usage Examples
After the tool is installed, open Windows command line and change into the tool installation directory. Below are three examples of the most common use cases: export, import and migration.
Exporting VM as an OVF image:
> ovftool “vi://username:password@source_host/vm_name” “vm_name.ovf”
Importing VM from an OVF image:
> ovftool -ds=”destination_datastore” “vm_name.ovf” “vi://username:password@destination_host”
Migrating VM between ESXi hosts:
> ovftool -ds=”destination_datastore” “vi://username:password@source_host/vm_name” “vi://username:password@destination_host”

When you are migrating, machine the tool is running on is still used as a proxy between two hosts, the only difference is you are not saving the OVF image to disk and don’t need disk space available on the proxy.
This is what it looks like in vSphere and HTML5 clients’ task lists:


Observations
When planning migrations using OVF Tool, throughput is an important consideration, because migration requires downtime.
OVF Tool is quite efficient in how it does export/import. Even for thick provisioned disks it reads only the consumed portion of the .vmdk. On top of that, generated OVF package is compressed.
Due to compression, OVF Tool is typically bound by the speed of ESXi host’s CPU. In the screenshot below you can see how export process takes 1 out of 2 CPU cores (compression is singe-threaded).

While testing on a 2 core Intel i5, I was getting 25MB/s read rate from disk and an average export throughput of 15MB/s, which is roughly equal to 1.6:1 compression ratio.
For a VM with a 100GB disk, that has 20GB of space consumed, this will take 20*1024/25 = 819 seconds or about 14 minutes, which is not bad if you ask me. On a Xeon CPU I expect throughput to be even higher.
Caveats
There are a few issues that you can potentially run into that are well-known, but I think are still worth mentioning here.
Special characters in URIs (string starting with vi://) must be escaped. Use % followed by the character HEX code. You can find character HEX codes here: http://www.techdictionary.com/ascii.html.
For example use “vi://root:P%40ssword@10.0.1.10”, instead of “vi://root:P@ssword@10.0.1.10” or you can get confusing errors similar to this:
Error: Could not lookup host: root
Disconnect ISO images from VMs before migrating them or you will get the following error:
Error: A general system error occurred: vim.fault.FileNotFound
Conclusion
OVF Tool requires downtime when exporting, importing or migrating VMs, which can be a deal-breaker for large scale migrations. When downtime is not a concern or for VMs that are small enough for the outage to be minimal, from now on OVF Tool will be my migration tool of choice.
Tags:CLI, ESX, ESXi, export, import, migration, OVF, OVF Tool, performance, speed, throughput, VM, vMotion, vmware, xVC-vMotion
Posted in Virtualization | 2 Comments »
March 14, 2015
Continuing a series of posts on how to deal with Force10 MXL switches. This one is about VLANs, port channels, tagging and all the basic stuff. It’s not much different from other vendors like Cisco or HP. At the end of the day it’s the same networking standards.
If you want to match the terminology with Cisco for instance, then what you used to as EtherChannels is Port Channels on Force10. And trunk/access ports from Cisco are called tagged/untagged ports on Force10.
Configure Port Channels
If you are after dynamic LACP port channels (as opposed to static), then they are configured in two steps. First step is to create a port channel itself:
# conf t
# interface port-channel 1
# switchport
# no shutdown
And then you enable LACP on the interfaces you want to add to the port channel. I have a four switch stack and use 0/.., 1/.. type of syntax:
# conf t
# int range te0/51-52 , te1/51-52 , te2/51-52 , te3/51-52
# port-channel-protocol lacp
# port-channel 1 mode active
To check if the port channel has come up use this command. Port channel obviously won’t init if it’s not set up on the other side of the port channel as well.
# show int po1 brief

Configure VLANs
Then you create your VLANs and add ports. Typically if you have vSphere hosts connected to the switch, you tag traffic on ESXi host level. So both your host ports and port channel will need to be added to VLANs as tagged. If you have any standalone non-virtualized servers – you’ll use untagged.
# conf t
# interface vlan 120
# description Management
# tagged Te0/1-4
# tagged Te2/1-4
# tagged Po1
# no shutdown
# copy run start
I have four hosts. Each host has a dual-port NIC which connects to two fabrics – switches 0 and 2 in the stack (1 port per fabric). I allow VLAN 120 traffic from these ports through the port channel to the upstream core switch.
You’ll most likely have more than one VLAN. At least one for Management and one for Production if it’s vSphere. But process for the rest is exactly the same.
The other switch
Just to give you a whole picture I’ll include the configuration of the switch on the other side of the trunk. I had a modular HP switch with 10Gb modules. A config for it would look like the following:
# conf t
# trunk I1-I8 trk1 lacp
# vlan 120 tagged trk1
# write mem
I1 to I8 here are ports, where I – is the module and 1 to 8 are ports within that module.
Tags:access, blade, dynamic, ESX, ESXi, etherchannel, Force10, LACP, MXL, port, port channel, static, switch, tagged, trunk, untagged, VLAN, vmware, vSphere
Posted in Networking | Leave a Comment »
March 6, 2014
Zerto is a VM replication product which works on a hypervisor level. In contrast to array level replication, which SRM has been using for a long time, it eliminates storage array from the equation and all the complexities which used to come along with it (SRAs, splitting the LUNs for replicated and non-replicated VMs, potential incompatibilities between the orchestrated components, etc).
Basic Operation
Zerto consists of two components: ZVM (Zerto Virtual Manger) and VRA (Virtual Replication Appliance). VRAs are VMs that need to be installed on each ESXi host within the vCenter environment (performed in automated fashion from within ZVM console). ZVM manages VRAs and all the replication settings and is installed one per vCenter. VRA mirrors protected VMs I/O operations to the recovery site. VMs are grouped in VPGs (Virtual Protection Groups), which can be used as a consistency group or just a container.
Protected VMs can be preseeded to DR site. But what Zerto essentially does is it replicates VM disks to any datastore on recovery site where you point it to and then tracks changes in what is called a journal volume. Journal is created for each VM and is kept as a VMDK within the “ZeRTO volumes” folder on a target datastore. Every few seconds Zerto creates checkpoints on a journal, which serve as crash consistent recovery points. So you can recover to any point in time, with a few seconds granularity. You can set the journal length in hours, depending on how far you potentially would want to go back. It can be anywhere between 1 and 120 hours.
VMs are kept unregistered from vCenter on DR site and VM configuration data is kept in Zerto repository. Which essentially means that if an outage happens and something goes really wrong and Zerto fails to bring up VMs on DR site you will need to recreate VMs manually. But since VMDKs themselves are kept in original format you will still be able to attach them to VMs and power them on.
Failover Scenarios
There are four failover scenarios within Zerto:
- Move Operation – VMs are shut down on production site, unregistered from inventory, powered on at DR site and protection is reversed if you decide to do so. If you choose not to reverse protection, VMs are completely removed from production site and VPG is marked as “Needs Configuration”. This scenario can be seen as a planned migration of VMs between the sites and needs both sites to be healthy and operational.
- Failover Operation – is used in disaster scenario when production site might be unavailable. In this case Zerto brings up protected VMs on DR site, but it does not try to remove VMs from production site inventory and leave them as is. If production site is still accessible you can optionally select to shutdown VMs. You cannot automatically reverse protection in this scenario, VPG is marked as “Needs Configuration” and can be activated later. And when it is activated, Zerto does all the clean up operations on the former production site: shuts down VMs (if they haven’t been already), unregister them from inventory and move to VRA folder on the datastore.
- Failover Test Operation – this is for failover testing and brings up VMs on DR site in a configured bubble network which is normally not uplinked to any physical network. VMs continue to run on both sites. Note that VMs disk files in this scenario are not moved to VMs folders (as in two previous scenarios) and are just connected from VRA VM folder. You would also notice that Zerto created second journal volume which is called “scratch” journal. Changes to the VM that is running on DR site are saved to this journal while it’s being tested.
- Clone Operation – VMs are cloned on DR site and connected to network. VMs are not automatically powered on to prevent potential network conflicts. This can be used for instance in DR site testing, when you want to check actual networking connectivity, instead of connecting VMs to an isolated network. Or for implementing backups, cloned environment for applications testing, etc.
Zerto Journal Sizing
By default journal history is configured as 4 hours and journal size is unlimited. Depending on data change rate within the VM journal can be smaller or larger. 15GB is approximately enough storage to support a virtual machine with 1TB of storage, assuming a 10% change rate per day with four hours of journal history saved. Zerto has a Journal Sizing Tool which helps to size journals. You can create a separate journal datastore as well.
Zerto compared to VMware Replication and SRM
There are several replication products in the market from VMware. Standalone VMware replication, VMware replication + SRM orchestraion and SRM array-based replication. If you want to know more on how they compare to Zerto, you can read the articles mentioned in references below. One apparent Zerto advantage, which I want to mention here, is integration with vCloud Director, which is essential for cloud providers who offer DRaaS solutions. SRM has no vCloud Director support.
References
Tags:array, Automation, bubble, checkpoint, clone, datastore, disaster recovery, DR, DRaaS, ESX, ESXi, failover, hypervisor, journal, LUN, orchestration, replication, sizing, SRA, SRM, vCenter, vCloud Director, Virtual Protection Group, Virtual Replication Appliance, VM, VPG, VRA, Zerto, Zerto Virtual Manager, ZVM
Posted in Virtualization | Leave a Comment »
September 25, 2013

DISCLAMER: I ACCEPT NO RESPONSIBILITY FOR ANY DAMAGE OR CORRUPTION OF DATA THAT MAY OCCUR AS A RESULT OF CARRYING OUT STEPS DESCRIBED BELOW. YOU DO THIS AT YOUR OWN RISK.
We had an issue with high CPU usage on one of the NetApp controllers servicing a couple of NFS datastores to VMware ESX cluster. HA pair of FAS2050 had two shelves, both of them owned by the first controller. The obvious solution for us was to reassign disks from one of the shelves to the other controller to balance the load. But how do you do this non-disruptively? Here is the plan.
In our setup we had two controllers (filer1, filer2), two shelves (shelf1, shelf2) both assigned to filer1. And two aggregates, each on its own shelf (aggr0 on shelf0, aggr1 on shelf1). Say, we want to reassign disks from shelf2 to filer2.
First step is to migrate all of the VMs from the shelf2 to shelf1. Because operation is obviously disruptive to the hosts accessing data from the target shelf. Once all VMs are evacuated, offline all volumes and an aggregate, to prevent any data corruption (you can’t take aggregate offline from online state, so change it to restricted first).
If you prefer to reassign disks in two steps, as described in NetApp Professional Services Tech Note #021: Changing Disk Ownership, don’t forget to disable automatic ownership assignment on both controllers, otherwise disks will be assigned back to the same controller again, right after you unown them:
> options disk.auto_assign off
It’s not necessary if you change ownership in one step as shown below.
Next step is to actually reassign the disks. Since they are already part of an aggregate you will need to force the ownership change:
filer1> disk assign 1b.01.00 -o filer2 -f
filer1> disk assign 1b.01.01 -o filer2 -f
…
filer1> disk assign 1b.01.nn -o filer2 -f
If you do not force disk reassignment you will get an error:
Assign request failed for disk 1b.01.0. Reason:Disk is part of a failed or offline aggregate or volume. Changing its owner may prevent aggregate or volume from coming back online. Ownership may be changed only by using the appropriate force option.
When all disks are moved across to filer2, new aggregate will show up in the list of aggregates on filer2 and you’ll be able to bring it online. If you can’t see the aggregate, force filer to rescan the drives by running:
filer2> disk show
The old aggregate will still be seen in the list on filer1. You can safely remove it:
filer1> aggr destroy aggr1
Tags:aggregate, assignment, controller, corruption, CPU, datastore, disk, ESX, FAS, Filer, force, load balancing, migrate, NetApp, NFS, non-disruptively, offline, online, own, ownership, reassign, restricted, shelf, unown, VM, vmware, volume
Posted in NetApp, VMware | Leave a Comment »
August 30, 2013
ESX server enforces complexity requirements on passwords and if the one you want to set up doesn’t meet them, password change will fail with something like that:
Weak password: not enough different characters or classes for this length. Try again.
You can obviously play with PAM settings to lower the requirements, but here the the tip on how to really quickly workaround that.
Simply generate a hash for you password using the following command:
# openssl passwd -1
And then replace the root password hash in /etc/shadow with the new one.
From my experience on ESX 4.1, you won’t even need to reconnect the host to the vCenter. It will continue working just fine.
Tags:change, complexity, ESX, hash, passwd, password, shadow, vmware
Posted in VMware | 1 Comment »
August 5, 2013
In one of my previous posts I spoke about three basic types of NetApp Virtual Storage Console restores: datastore restore, VM restore and backup mount. The last and the least used feature, but very underrated, is the Single File Restore (SFR), which lets you restore single files from VM backups. You can do the same thing by mounting the backup, connecting vmdk to VM and restore files. But SFR is a more convenient way to do this.
Workflow
SFR is pretty much an out-of-the-box feature and is installed with VSC. When you create an SFR session, you specify an email address, where VSC sends an .sfr file and a link to Restore Agent. Restore Agent is a separate application which you install into VM, where you want restore files to (destination VM). You load the .sfr file into Restore Agent and from there you are able to mount source VM .vmdks and map them to OS.
VSC uses the same LUN cloning feature here. When you click “Mount” in Restore Agent – LUN is cloned, mapped to an ESX host and disk is connected to VM on the fly. You copy all the data you want, then click “Dismount” and LUN clone is destroyed.
Restore Types
There are two types of SFR restores: Self-Service and Limited Self-Service. The only difference between them is that when you create a Self-Service session, user can choose the backup. With Limited Self-Service, backup is chosen by admin during creation of SFR session. The latter one is used when destination VM doesn’t have connection to SMVI server, which means that Remote Agent cannot communicate with SMVI and control the mount process. Similarly, LUN clone is deleted only when you delete the SFR session and not when you dismount all .vmdks.
There is another restore type, mentioned in NetApp documentation, which is called Administartor Assisted restore. It’s hard to say what NetApp means by that. I think its workflow is same as for Self-Service, but administrator sends the .sfr link to himself and do all the job. And it brings a bit of confusion, because there is an “Admin Assisted” column on SFR setup tab. And what it actually does, I believe, is when Port Group is configured as Admin Assisted, it forces SFR to create a Limited Self-Service session every time you create an SFR job. You won’t have an option to choose Self-Assisted at all. So if you have port groups that don’t have connectivity to VSC, check the Admin Assisted option next to them.
Notes
Keep in mind that SFR doesn’t support VM’s with IDE drives. If you try to create SFR session for VMs which have IDE virtual hard drives connected, you will see all sorts of errors.
Tags:assisted, backup, clone, datastore, dismount, ESX, ESXi, IDE, limited, link, LUN, map, mount, NetApp, port group, restore, Restore Agent, self-service, session, SFR, Single File Restore, SMVI, virtual machine, Virtual Storage Console, VM, vmdk, VSC
Posted in NetApp, VMware | Leave a Comment »
July 30, 2013
Queue Limits
I/O data goes through several storage queues on its way to disk drives. VMware is responsible for VM queue, LUN queue and HBA queue. VM and LUN queues are usually equal to 32 operations. It means that each ESX host at any moment can have no more than 32 active operations to a LUN. Same is true for VMs. Each VM can have as many as 32 active operations to a datastore. And if multiple VMs share the same datastore, their combined I/O flow can’t go over the 32 operations limit (per LUN queue for QLogic HBAs has been increased from 32 to 64 operations in vSphere 5). HBA queue size is much bigger and can hold several thousand operations (4096 for QLogic, however I can see in my config that driver is configured with 1014 operations).
Queue Monitoring
You can monitor storage queues of ESX host from the console. Run “esxtop”, press “d” to view disk adapter stats, then press “f” to open fields selection and add Queue Stats by pressing “d”.
AQLEN column will show the queue depth of the storage adapter. CMDS/s is the real-time number of IOPS. DAVG is the latency which comes from the frame traversing through the “driver – HBA – fabric – array SP” path and should be less than 20ms. Otherwise it means that storage is not coping. KAVG shows the time which operation spent in hypervisor kernel queue and should be less than 2ms.
Press “u” to see disk device statistics. Press “f” to open the add or remove fields dialog and select Queue Stats “f”. Here you’ll see a number of active (ACTV) and queue (QUED) operations per LUN. %USD is the queue load. If you’re hitting 100 in %USD and see operations under QUED column, then again it means that your storage cannot manage the load an you need to redistribute your workload between spindles.
Some useful documents:
Tags:ACTV, AQLEN, CMDS, datastore, DAVG, device, driver, ESX, ESXi, esxtop, fabric, FC, HBA, I/O, IOPS, KAVG, latency, LUN, monitoring, operation, QLogic, QUED, queue, spindles, statistics, stats, storage, USD, virtual machine, VM, workload
Posted in VMware | Leave a Comment »
June 12, 2013
NetApp Virtual Storage Console is a plug-in for VMware vCenter which provides capabilities to perform instant backup/restore using NetApp snapshots. It uses several underlying NetApp features to accomplish its tasks, which I want to describe here.
Backup Process
When you configure a backup job in VSC, what VSC does, is it simply creates a NetApp snapshot for a target volume on a NetApp filer. Interestingly, if you have two VMFS datastores inside one volume, then both LUNs will be snapshotted, since snapshots are done on the volume level. But during the datastore restore, the second volume will be left intact. You would think that if VSC reverts the volume to the previously made snapshot, then both datastores should be affected, but that’s not the case, because VSC uses Single File SnapRestore to restore the LUN (this will be explained below). Creating several VMFS LUNs inside one volume is not a best practice. But it’s good to know that VSC works correctly in this case.
Same thing for VMs. There is no sense of backing up one VM in a datastore, because VSC will make a volume snapshot anyway. Backup the whole datastore in that case.
Datastore Restore
After a backup is done, you have three restore options. The first and least useful kind is a datastore restore. The only use case for such restore that I can think of is disaster recovery. But usually disaster recovery procedures are separate from backups and are based on replication to a disaster recovery site.
VSC uses NetApp’s Single File SnapRestore (SFSR) feature to restore a datastore. In case of a SAN implementation, SFSR reverts only the required LUN from snapshot to its previous state instead of the whole volume. My guess is that SnapRestore uses LUN clone/split functionality in background, to create new LUN from the snapshot, then swap the old with the new and then delete the old. But I haven’t found a clear answer to that question.
For that functionality to work, you need a SnapRestore license. In fact, you can do the same trick manually by issuing a SnapRestore command:
> snap restore -t file -s nightly.0 /vol/vol_name/vmfs_lun_name
If you have only one LUN in the volume (and you have to), then you can simply restore the whole volume with the same effect:
> snap restore -t vol -s nightly.0 /vol/vol_name
VM Restore
VM restore is also a bit controversial way of restoring data. Because it completely removes the old VM. There is no way to keep the old .vmdks. You can use another datastore for particular virtual hard drives to restore, but it doesn’t keep the old .vmdks even in this case.
VSC uses another mechanism to perform VM restore. It creates a LUN clone (don’t confuse with FlexClone,which is a volume cloning feature) from a snapshot. LUN clone doesn’t use any additional space on the filer, because its data is mapped to the blocks which sit inside the snapshot. Then VSC maps the new LUN to the ESXi host, which you specify in the restore job wizard. When datastore is accessible to the ESXi host, VSC simply removes the old VMDKs and performs a storage vMotion from the clone to the active datastore (or the one you specify in the job). Then clone is removed as part of a clean up process.
The equivalent cli command for that is:
> lun clone create /vol/clone_vol_name -o noreserve -b /vol/vol_name nightly.0
Backup Mount
Probably the most useful way of recovery. VSC allows you to mount the backup to a particular ESXi host and do whatever you want with the .vmdks. After the mount you can connect a virtual disk to the same or another virtual machine and recover the data you need.
If you want to connect the disk to the original VM, make sure you changed the disk UUID, otherwise VM won’t boot. Connect to the ESXi console and run:
# vmkfstools -J setuuid /vmfs/volumes/datastore/VM/vm.vmdk
Backup mount uses the same LUN cloning feature. LUN is cloned from a snapshot and is connected as a datastore. After an unmount LUN clone is destroyed.
Some Notes
VSC doesn’t do a good cleanup after a restore. As part of the LUN mapping to the ESXi hosts, VSC creates new igroups on the NetApp filer, which it doesn’t delete after the restore is completed.
What’s more interesting, when you restore a VM, VSC deletes .vmdks of the old VM, but leaves all the other files: .vmx, .log, .nvram, etc. in place. Instead of completely substituting VM’s folder, it creates a new folder vmname_1 and copies everything into it. So if you use VSC now and then, you will have these old folders left behind.
Tags:backup, clone, datastore, disaster recovery, disk, ESX, ESXi, Filer, FlexClone, igroup, job, license, LUN, mount, NetApp, restore, SFSR, Single File SnapRestore, snap restore, SnapRestore, snapshot, split, storage vMotion, unmount, UUID, vCenter, Virtual Storage Console, VMFS, vmkfstools, vMotion, vmware, volume, VSC
Posted in NetApp, VMware | 1 Comment »
March 27, 2012
When it comes to VMware on NetApp, boosting performance by implementing Jumbo Frames is always taken into consideration. However, it’s not clear if it really has any significant impact on latency and throughput.
Officially VMware doesn’t support Jumbo Frames for NAS and iSCSI. It means that using Jumbo Frames to transfer storage traffic from VMkernel interface to your storage system is the solution which is not tested by VMware, however, it actually works. To use Jumbo Frames you need to activate them throughout the whole communication path: OS, virtual NIC (change to Enchanced vmxnet from E1000), Virtual Switch and VMkernel, physical ethernet switch and storage. It’s a lot of work to do and it’s disruptive at some points, which is not a good idea for production infrastructure. So I decided to take a look at benchmarks, before deciding to spend a great amount of time and effort on it.
VMware and NetApp has a TR-3808-0110 technical report which is called “VMware vSphere and ESX 3.5 Multiprotocol Performance Comparison Using FC, iSCSI, and NFS”. Section 2.2 clearly states that:
- Using NFS with jumbo frames enabled using both Gigabit and 10GbE generated overall performance that was comparable to that observed using NFS without jumbo frames and required approximately 6% to 20% fewer ESX CPU resources compared to using NFS without jumbo frames, depending on the test configuration.
- Using iSCSI with jumbo frames enabled using both Gigabit and 10GbE generated overall performance that was comparable to slightly lower than that observed using iSCSI without jumbo and required approximately 12% to 20% fewer ESX CPU resources compared to using iSCSI without jumbo frames depending on the test configuration.
Another important statement here is:
- Due to the smaller request sizes used in the workloads, it was not expected that enabling jumbo frames would improve overall performance.
I believe that 4K and 8K packet sizes are fair in case of virtual infrastructure. Maybe if you move large amounts of data through your virtual machines it will make sense for you, but I feel like it’s not reasonable to implement Jumbo Frames for virual infrastructure in general.
The another report finding is that Jumbo Frames decrease CPU load, but if you use TOE NICs, then no sense once again.
VMware supports jumbo frames with the following NICs: Intel (82546, 82571), Broadcom (5708, 5706, 5709), Netxen (NXB-10GXxR, NXB-10GCX4), and Neterion (Xframe, Xframe II, Xframe E). We use Broadcom NetXtreme II BCM5708 and Intel 82571EB, so Jumbo Frames implementation is not going to be a problem. Maybe I’ll try to test it by myself when I’ll have some free time.
Links I found useful:
Tags:10GbE, Broadcom, CPU, E1000, ESX, FC, Intel, iSCSI, jumbo frames, latency, NAS, NetApp, Neterion, Netxen, NFS, nic, OS, performance, storage, support, TCP offload engine, throughput, TOE, virtual switch, VMkernel, vmware, vmxnet
Posted in NetApp, Virtualization | 1 Comment »
March 16, 2012
If you use NetApp as a storage for you VMware hard drives, it’s wise to utilize NetApp’s powerful snapshot capabilities as an instant backup tool. I shortly mentioned in my previous post that you should disable default snapshot schedule. Snapshot is done very quickly on NetApp, but still it’s not instantaneous. If VM is running you can get .vmdks which have inconsistent data. Here I’d like to describe how you can perform consistent snapshots of VM hard drives which sit on NetApp volumes exported via NFS. Obviously it won’t work for iSCSI LUNs since you will have LUNs snapshots which are almost useless for backups.
What makes VMware virtualization platform far superior to other well-known solutions in the market is VI APIs. VI API is a set of Web services hosted on Virtual Center and ESX hosts that provides interfaces for all components and operations. Particularly, there is a Perl interface for VI API which is called VMware Infrastructure Perl Toolkit. You can download and install it for free. Using VI Perl Toolkit you can write a script which will every day put your VMs in a so called hot backup mode and make NetApp snapshots as well. Practically, hot backup mode is also a snapshot. When you create a VM snapshot, original VM hard drive is left intact and VMware starts to write delta in another file. It means that VM hard drive won’t change when making NetApp snapshot and you will get consistent .vmdk files. Now lets move to implementation.
I will write excerpts from the actual script here, because lines in the script are quite long and everything will be messed up on the blog page. I uploaded full script on FileDen. Here is the link. I apologize if you read this blog entry far later than it was published and my account or the FileDen service itself no longer exist.
VI Perl Toolkit is effectively a set of Perl scripts which you run as ready to use utilities. We will use snapshotmanager.pl which lets you create VMware VM snapshots. In the first step you make snapshots of all VMs:
\”$perl_path\perl\” -w \”$perl_toolkit_path\snapshotmanager.pl\” –server vc_ip –url https://vc_ip/sdk/vimService –username snapuser –password 123456 –operation create –snapshotname \”Daily Backup Shapshot\”
For the sake of security I created Snapshot Manager role and respective user account in Virtual Center with only two allowed operations: Create Snapshot and Remove Snapshot. Run line is self explanatory. I execute it using system($run_line) command.
After VM snapshots are created you make a NetApp snapshot:
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap create vm_sata snap_name
To connect to NetApp terminal I use PuTTY ssh client. putty.exe itself has a GUI and plink.exe is for batch scripting. Using this command you create snapshot of particular NetApp volume. Those which hold .vmdks in our case.
To get all VMs from hot backup mode run:
\”$perl_path\perl\” -w \”$perl_toolkit_path\snapshotmanager.pl\” –server vc_ip –url https://vc_ip/sdk/vimService –username snapuser –password 123456 –operation remove –snapshotname \”Daily Backup Shapshot\” –children 0
By –children 0 here we tell not to remove all children snapshots.
After we familiarized ourselves with main commands, lets move on to the script logic. Apparently you will want to have several snapshots. For example 7 of them for each day of the week. It means each day, before making new snapshot you will need to remove oldest and rename others. Renaming is just for clarity. You can name your snapshots vmsnap.1, vmsnap.2, … , vmsnap.7. Where vmsnap.7 is the oldest. Each night you put your VMs in hot backup mode and delete the oldest snapshot:
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap delete vm_sata vmsnap.7
Then you rename other snapshots:
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.6 vmsnap.7
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.5 vmsnap.6
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.4 vmsnap.5
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.3 vmsnap.4
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.2 vmsnap.3
And create the new one:
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap create vm_sata vmsnap.1
As a last step you bring your VMs out of hot backup mode.
Using this technique you can create short term backups of your virtual infrastructure and use them for long term retention with help of standalone backup solutions. Like backing up data from snapshots to tape library using Symantec BackupExec. I’m gonna talk about this in my later posts.
Tags:API, backup, consistent, ESX, hot backup mode, inconsistent, iSCSI, LUN, NetApp, NFS, Perl, plink, PuTTY, script, snapshot, snapshotmanager.pl, storage, VI Perl Toolkit, Virtual Center, virtual machine, VM, vmdk, VMware Infrastructure Perl Toolkit, volume, Web services
Posted in NetApp, Virtualization | Leave a Comment »