Posts Tagged ‘LUN’

Resize limits for SAN LUNs

July 1, 2013

If you run lun resize command on NetApp you might run into the following error:

lun resize: New size exceeds this LUN’s initial geometry

The reason behind it is that each SAN LUN has head/cylinder/sector geometry. It’s not an actual physical mapping to the underlying disks and has no meaning these days. It’s simply a SCSI protocol artifact. But it imposes limitation on maximum LUN resize. Geometry is chosen at initial LUN creation and cannot be changed. Roughly you can resize the LUN to the size, which is 10 times bigger than the size at the time of creation. For example, the 50GB LUN can be extended to the maximum of 502GB. See the table below for the maximum sizes:

Initial Size   Maximum Sizedata-storage
< 50g          502g
51-100g        1004g
101-150g       1506g
151-200g       2008g
201-251g       2510g
252-301g       3012g
302-351g       3514g
352-401g       4016g

To check the maximum size for particular LUN use the following commands:

> priv set diag
> lun gemetry lun_path
> priv set

If you run into this issue, unfortunately you will need to create a new LUN, copy all the data using robocopy for example and make a cutover. Because such features as volume level SnapMirror or ndmpcopy will recreate the LUN’s geometry together with the data.

Magic behind NetApp VSC Backup/Restore

June 12, 2013

netapp_dpNetApp Virtual Storage Console is a plug-in for VMware vCenter which provides capabilities to perform instant backup/restore using NetApp snapshots. It uses several underlying NetApp features to accomplish its tasks, which I want to describe here.

Backup Process

When you configure a backup job in VSC, what VSC does, is it simply creates a NetApp snapshot for a target volume on a NetApp filer. Interestingly, if you have two VMFS datastores inside one volume, then both LUNs will be snapshotted, since snapshots are done on the volume level. But during the datastore restore, the second volume will be left intact. You would think that if VSC reverts the volume to the previously made snapshot, then both datastores should be affected, but that’s not the case, because VSC uses Single File SnapRestore to restore the LUN (this will be explained below). Creating several VMFS LUNs inside one volume is not a best practice. But it’s good to know that VSC works correctly in this case.

Same thing for VMs. There is no sense of backing up one VM in a datastore, because VSC will make a volume snapshot anyway. Backup the whole datastore in that case.

Datastore Restore

After a backup is done, you have three restore options. The first and least useful kind is a datastore restore. The only use case for such restore that I can think of is disaster recovery. But usually disaster recovery procedures are separate from backups and are based on replication to a disaster recovery site.

VSC uses NetApp’s Single File SnapRestore (SFSR) feature to restore a datastore. In case of a SAN implementation, SFSR reverts only the required LUN from snapshot to its previous state instead of the whole volume. My guess is that SnapRestore uses LUN clone/split functionality in background, to create new LUN from the snapshot, then swap the old with the new and then delete the old. But I haven’t found a clear answer to that question.

For that functionality to work, you need a SnapRestore license. In fact, you can do the same trick manually by issuing a SnapRestore command:

> snap restore -t file -s nightly.0 /vol/vol_name/vmfs_lun_name

If you have only one LUN in the volume (and you have to), then you can simply restore the whole volume with the same effect:

> snap restore -t vol -s nightly.0 /vol/vol_name

VM Restore

VM restore is also a bit controversial way of restoring data. Because it completely removes the old VM. There is no way to keep the old .vmdks. You can use another datastore for particular virtual hard drives to restore, but it doesn’t keep the old .vmdks even in this case.

VSC uses another mechanism to perform VM restore. It creates a LUN clone (don’t confuse with FlexClone,which is a volume cloning feature) from a snapshot. LUN clone doesn’t use any additional space on the filer, because its data is mapped to the blocks which sit inside the snapshot. Then VSC maps the new LUN to the ESXi host, which you specify in the restore job wizard. When datastore is accessible to the ESXi host, VSC simply removes the old VMDKs and performs a storage vMotion from the clone to the active datastore (or the one you specify in the job). Then clone is removed as part of a clean up process.

The equivalent cli command for that is:

> lun clone create /vol/clone_vol_name -o noreserve -b /vol/vol_name nightly.0

Backup Mount

Probably the most useful way of recovery. VSC allows you to mount the backup to a particular ESXi host and do whatever you want with the .vmdks. After the mount you can connect a virtual disk to the same or another virtual machine and recover the data you need.

If you want to connect the disk to the original VM, make sure you changed the disk UUID, otherwise VM won’t boot. Connect to the ESXi console and run:

# vmkfstools -J setuuid /vmfs/volumes/datastore/VM/vm.vmdk

Backup mount uses the same LUN cloning feature. LUN is cloned from a snapshot and is connected as a datastore. After an unmount LUN clone is destroyed.

Some Notes

VSC doesn’t do a good cleanup after a restore. As part of the LUN mapping to the ESXi hosts, VSC creates new igroups on the NetApp filer, which it doesn’t delete after the restore is completed.

What’s more interesting, when you restore a VM, VSC deletes .vmdks of the old VM, but leaves all the other files: .vmx, .log, .nvram, etc. in place. Instead of completely substituting VM’s folder, it creates a new folder vmname_1 and copies everything into it. So if you use VSC now and then, you will have these old folders left behind.

Unexpected Deduplication Impact on VMware I/O Latency

May 28, 2013

NetApp deduplication is a postponed process. During normal operation Data ONTAP only calculates hashes for the data blocks. Actual deduplication is carried out off-hours as per configured schedule. Hash calculation doesn’t affect performance in most cases. I talked about that in my previous post. NetApp states in its documentation that deduplication is a low-priority process:

When one deduplication process is running, there is 0% to 15% performance degradation on other applications.

Once I faced a situation when deduplication was configured to be carried out during business hours on one of the volumes. No one noticed that at some point volume run out of space and Data ONTAP wasn’t able to perform deduplication from that time. Situation became worse when Data ONTAP was upgraded from version 7.3.2 to 8.1.0. Now during deduplication filer tried to upgrade the fingerprint metadata to a new version at 15:00 every day with the message: “Fingerprint is being upgraded” and failed. It seems that the metadata upgrade is a very resource-intensive process and heavily affects I/O latency.

This volume was not a VMware datastore, but it sit on the same aggregate together with the several VMFS LUNs. Here what happened to the VMware I/O latency every day at 15:00 (click to enlarge):

dedup_issue_ed

I deleted the host name and the datastores names from the graph. You can see the large latency spike, which won’t turn yourVMs into kernel panic, but it’s not the thing you would want your production environment to experience every day.

The solution was simple. After space was increased on this volume, deduplication metadata upgrade performed successfully and problem went away. Additionally, deduplication was shifted to off-hours.

The simple lesson to learn: don’t schedule deduplication during the day, you never know what could possibly go wrong.

Connecting VMware ESXi Hosts to NetApp: MPIO Configuration

May 23, 2013

fas3140aOverview

NetApp filers are active/active ALUA arrays. It means that you can access LUNs configured on one controller via the second one. But access to the partner’s LUNs is provided through the internal interconnect and is always slower. That’s why the paths to the controller through the partner are called “unoptimized”. Their primary usage is to provide backup paths in case of a failover.

Fixed path selection

VMware hosts by default use “VMW_SATP_DEFAULT_AA” Storage Array Type Policy (SATP) and “Fixed” Path Selection Policy (PSP) for active/active arrays. If ESXi host is configured with these SATP and PSP, it will access each LUN via one particular path, even if you have two FC ports on each of the controllers.

VMware host can’t automatically identify optimized path. So you can either set it manually or use functionality of NetApp Virtual Storage Console (VSC) plug-in for VMware. Just go to the Monitoring and Host Configuration -> Overview section of VSC, right click on ESXi host and click “Set Recommended Values”. If you don’t do that, ESXi hosts will run I/O traffic through a randomly identified path, which could turn out to be unoptimized. It means you will push heaps of I/O through the partner node and experience higher latencies.

You can check if you’re using non-optimized paths by looking for such warnings on NetApps:

filer_01> Mon May 6 10:30:45 EST [filer_01: ems.engine.inputSuppress:error]: Event ‘scsitarget.partnerPath.misconfigured’ suppressed 327 times since Mon May 6 09:30:48 EST 2013.
Mon May 6 10:30:45 EST [filer_01: scsitarget.partnerPath.misconfigured:error]: FCP Partner Path Misconfigured – Host I/O access through a non-primary and non-optimal path was detected.

Or run “lun stats -o” and look for huge numbers under “Partner Ops” and “Partner KBytes”.

ALUA configuration

If you want to utilize both links to the controller in a round robin fashion, you need to do some additional configuration. You should enable ALUA for your VMware ESXi hosts initiator group on NetApp:

igroup set <group> alua yes

Now you need to reboot ESXi host. After a reboot it will see that storage is ALUA-capable and change SATP to VMW_SATP_ALUA and PSP is “Most Recently Used”. To utilize load balancing between two controller paths you need to change PSP to “Round Robin”. Again, you can do that either manually or via VSC.

Note: Don’t ever use ALUA and VMW_SATP_ALUA if you have Windows Server 2003 MSCS or Windows Server 2008 Failover Cluster with shared RDM LUNs. It’s an unsupported configuration and you can run into a cluster failure situation. It’s described in many places:

In this case leave SATP as “VMW_SATP_DEFAULT_AA”,  PSP as “Fixed” and make sure that you use optimized paths.

NetApp thin-provisioning for VMware LUNs

May 22, 2013

thin

LUN and Volume Thin Provisioning

I already described thin provisioning of VMware NFS volumes some time ago here. Now I want to discuss thin provisioning of LUNs.

LUNs are different from VMFS on top of NFS implementation, because LUN is an additional container inside of NetApp FlexVol. So if you’re using FC, you need to thin provision both LUN and volume:

> lun set reservation “/vol/targetvol/targetlun” disable
> vol options “targetvol” guarantee none

In fact, you can make the LUN thin and the volume thick. Then storage space that’s not used by the LUN, is returned to the volume level. But in this case it cannot be used by other volumes as a shared pool of space.

As the best practice, NetApp now recommends to set Fractional Reserve and Snap Reserve for your volumes to 0%. Don’t forget about that, if you want to save more storage space:

> vol options “targetvol” fractional_reserve 0
> snap reserve “targetvol” 0

Disable snapshots if you don’t use them:

> snap sched “targetvol” 0

It’s easy as that. Now you don’t waste your space by reserving it ahead, but use it as a shared pool of resources. But make sure to monitor aggregate free space. If you starting to run out of storage, plan purchase of new disks in advance or redistribute data between other aggregates.

Safety Features

Disabling volume level, LUN and snapshot reservations helps you to save storage space. The drawback of this approach is that you don’t have any mechanisms in place to prevent volume out-of-space situations. If you enable snapshots on the volume and they consume all the volume space, the volume goes offline. Very undesirable consequence. NetApp has two features that can serve as safety net in thin-provisioned environments: autosize and snap autodelete.

Snap autodelete automatically removes old snapshots if there is no space left inside the volume. Autosize, on the other hand, allows the volume to automatically grow to the specified limit (+20% to the volume size by default) in specified increments (5% of the volume size by the default). You can also specify what to do first autosize or snapshot autodelete by using ‘try_first’ option.

> snap autodelete “targetvol” on
> vol autosize “targetvol” on
> vol options “targetvol” try_first volume_grow

SnapMirror Considerations

If you use SnapMirroring and switch on the autosize on the source volume, then the destination volume won’t grow automatically. And SnapMirror will break the relationship if it runs out of space on the smaller destination volume. The trick here is to make the destination volume as big as the autosize limit for the source volume and thin provision the destination volume. By doing that you won’t run out of space on destination even if the source volume grows to its maximum.

Further reading

TR-3965: NetApp Thin Provisioning Deployment and Implementation Guide Data ONTAP 8.1 7-Mode

NetApp Deduplication in a Nutshell

May 12, 2013

NetApp-Dedupe2NetApp uses Write Anywhere File Layout (WAFL) filesystem which is a key for NetApp’s efficient snapshot technology. If you’re already familiar with how snapshots are implemented in Data ONTAP, then understanding underlying mechanisms of deduplication is simple. Filer calculates hash for each data block it receives and preserves this in the form of metadata on the volume level. Then according to deduplication schedule (usually on weekends), filer runs metadata processing and for each duplicate hash changes data pointer to the original block of data.

Since for each data block filer needs to calculate hash on the fly, it has its penalty. On systems with CPU loaded by less than 50% performance impact is negligible. For heavily loaded systems, where CPU is nearly 100% busy, performance impact can be around 15%. For high-end 6000 systems penalty can jump up to 35% for random writes. Heavy sequential read operations can also suffer from deduplication, because read operations can be rerouted in random way across physical storage, depending on where the original data block actually is. In general, deduplication has low impact on system performance. But you can’t use it blindly and should keep in mind that in particular cases it can slow down your storage system.

Deduplication configuration is pretty simple. First of all, you need to activate deduplication on particular volume:

> sis on “targetvol”

If you already have data on the volume, you need to scan it. Otherwise, it won’t be deduped. It’s a common mistake. Deduplication is a low priority task, but keep in mind, however, that it can slightly impact your storage performance when done during business hours, especially if you run deduplication for several volumes simultaneously.

> sis start -s “targetvol”

To show the status of deduplication for particular volume:

> sis status “targetvol”

To see the deduplication schedule:

> sis config “targetvol”

And the most pleasant command to find out how much data you’ve saved:

> df -s “targetvol”

If you want to undo deduplication, first switch it off and then undo it using the following commands:

> sis off “targetvol”
> priv set advanced
*> sis undo “targetvol”
*> priv set

I, personally, was able to achieve 40% deduplication rate for VMware VMFS datastores, which is rather impressive, considering these were mixed OS + application data LUNs.

As a final note, I would like to point out, that deduplication is suitable only for environments with high percentage of similar data. VMware is a good example of it. You won’t get any significant deduplication ratio for swap file volumes, Exchnage mailboxes or Symantec Enterprise Vault which is already deduped.

Further reading:

TR-3958: NetApp Data Compression and Deduplication Deployment and Implementation Guide: Data ONTAP 8.1 Operating in 7-Mode

Storwize V7000 with vSphere 5 storage configuration

December 1, 2012

storwizeInformation on how to configure Storwize for optimal performance is very scarce. I’ll try to build some understanding of it from bits an pieces gathered throughout the Internet and redbooks.

Barry Whyte gave many insights on Storwize internals in his blog. Particularly his “Configuring IBM Storwize V7000 and SVC for Optimal Performance” series of posts. I’ll quote him here. The main Storwize redbook “Implementing the IBM Storwize V7000 V6.3” is mostly an administration guide and gives no useful information on the topic. I find “SAN Volume Controller Best Practices and Performance Guidelines” way more helpful (Storwize firmware is built on SVC code).

Total Number of MDisks

That’s what Barry says:

… At the heart of each V7000 controller canister is an Intel Jasper Forrest (Sandy Bridge) based quad core CPU. … When we added the tried and trusted (SSA) DS8000 RAID functionality in 2010 (6.1.0) we therefore assigned RAID processing on a per mdisk basis to a single core. That means you need at least 4 arrays per V7000 to get maximal CPU core performance. …

Number of MDisks per Storage Pool

SVC Redbook:

The capability to stripe across disk arrays is the single most important performance advantage of the SVC; however, striping across more arrays is not necessarily better. The objective here is to only add as many arrays to a single Storage Pool as required to meet the performance objectives.

If the Storage Pool is already meeting its performance objectives, we recommend that, in most cases, you add the new MDisks to new Storage Pools rather than add the new MDisks to existing Storage Pools.

Table 5-1 shows the recommended number of arrays per Storage Pool that is appropriate for general cases.

Controller type       Arrays per Storage Pool
DS4000/DS5000         4 - 24
DS6000/DS8000         4 - 12
IBM Storwise V7000    4 - 12

The development recommendations for Storwize V7000 are summarized below:

  • One MDisk group per storage subsystem
  • One MDisk group per RAID array type (RAID 5 versus RAID 10)
  • One MDisk and MDisk group per disk type (10K versus 15K RPM, or 146 GB versus 300 GB)

There are situations where multiple MDisk groups are desirable:

  • Workload isolation
  • Short-stroking a production MDisk group
  • Managing different workloads in different groups

We recommend that you have at least two MDisk groups, one for key applications, another for everything else.

Number of LUNs per Storage Pool

SVC Redbook:

We generally recommend that you configure LUNs to use the entire array, which is especially true for midrange storage subsystems where multiple LUNs configured to an array have shown to result in a significant performance degradation. The performance degradation is attributed mainly to smaller cache sizes and the inefficient use of available cache, defeating the subsystem’s ability to perform “full stride writes” for Redundant Array of Independent Disks 5 (RAID 5) arrays. Additionally, I/O queues for multiple LUNs directed at the same array can have a tendency to overdrive the array.

Table 5-2 provides our recommended guidelines for array provisioning on IBM storage subsystems.

Controller type                     LUNs per array
IBM System Storage DS4000/DS5000    1
IBM System Storage DS6000/DS8000    1 - 2
IBM Storwize V7000                  1

General considerations

vsphere5-logoLets take a look at vSphere use case scenario on top of Storwize with 16 x 600GB SAS drives in control enclosure and 10 x 2TB NL-SAS in extension enclosure (our personal case).

First of all we need to decide how many arrays we need. Do we have different workloads? No. All storage will be assigned to virtual machines which have in general the same random read/write access pattern. Do we need to isolate workloads? Probably yes, it’s generally a good idea to separate highly critical production VMs from everything else. Do we have different drive types? Yes. Obviously we don’t want to mix drive types in one RAID. Are we going to make different RAID types? Again, yes. RAID 10 is appropriate on SAS and RAID 5 on NL-SAS. So two MDisks – one RAID 10 on SAS and one RAID 5 on NL-SAS would be enough. Storwize nodes have 4 cores each. It may seem that you would benefit from 4 MDisks, but in fact you won’t. Here what Barry says:

In the case where you only have 1 or 2 HDD arrays, then the core stuff doesn’t really come into play. Its only when you get to larger systems, where you are driving more I/O than a single RAID core can handle that you need to spread them.

This is also true if you are running all SSD arrays, so 24x SSD would be best split into 4 arrays to get maximum IOPs, whereas 24x HDD are not going to saturate a single core, so (if you could create a 23+P! [ you can’t 15+P is largest we support ] then it would perform as well as 2x 11+P etc

To storage pools. In our example we have two MDisks, so you simply make two storage pools. In future if you hit performance limit, you create additional MDisks and then you have two options. If each MDisk separately is able to sustain your performance requirements, you make additional storage pools and redistribute workload between them. If you have huge load on storage and even redistribution of VMs between two arrays doesn’t help, then you better combine two MDisks of each type in its own storage pool for striping between MDisks.

Same story for number of LUNs. IBM recommends one to one LUN to MDisk relationship. But read carefully. Recommendation comes from the fact that different workloads can clash and degrade array performance. But if we have generally the same I/O patterns coming to the array it’s safe to make several LUNs on it, until latency is in the acceptable range. Moreover, when it comes to vSphere and VMFS, it’s beneficial to have at least two volumes in terms of manageability. With several LUNs you will at least have an ability to move VMs between LUNs for reconfiguration purposes. Also keep in mind that ESXi 5 hypervisor limit each host to storage queue of depth 32 per LUN. It means that if you have one big LUN and many VMs running on the host, you can quickly reach queue limit. On the other hand do not create too many LUNs or you will oversubscribe storage processors (SPs).

Sample configuration

IBM recommends constructing both RAID 10 and RAID 5 arrays from 8 drives + 1 spare drive. But since we have 16 SAS and 10 NL-SAS I would launch CLI and create two arrays: one 14 drives + 2 spares RAID 10 and one 8 drives + 2 spares RAID 5 (or 9 drives + 1 spare, but it’s not a good idea to create RAID with uneven number of drives). Each RAID in its own pool. Several LUNs in each pool. I would go for 2TB LUNs.

Zoning vs. LUN masking explained

September 28, 2012

Zoning and masking terms are often confused by those who just started working with SAN. But it takes 5 minutes googling to understand that the main difference is that zoning is configured on a SAN switch on a port basis (or WWN) and masking is a storage feature with LUN granularity. All modern hardware supports zoning and masking. Given that, the much more interesting question here is what’s the point of zoning if there is masking with finer granularity.

Both security features do the same thing, restrict access to particular storage targets. And it seems that there is no point in configuring both of them. But that’s not true. One, not that convincing argument, is that in case one of the features is accidentally misconfigured, you still maintain security. But the much bigger issue in no-zoning configuration are RSCNs. RSCNs are Registered State Change Notification messages which are issued by SAN Name Server service when fabric changes it’s configuration (new device has been added to the fabric, a zone has changed, a switch name or IP address has changed, etc). RSCNs can be disruptive to fabric operation. And if you don’t have zones RSCNs are flooded to everyone each time something changes in a fabric, even if it has nothing to do with majority of devices. So zoning is a SAN best practice and its configuration is highly recommended.

In fact, Brocade recommends to adopt a so called Single Initiator Zoning (SIZ) practice, when one host pWWN (initiator) is zoned to one or more storage pWWNs. It reduces RSCN issue to a minimum.

As a best reference read Brocade’s: Secure SAN Zoning – Best Practices.

Windows MPIO with IBM storage

September 17, 2012

IBM mid-range storage systems (like DS3950) work in active/passive mode. It means that access to each LUN is given through one controller, in constrast to active/active storage where data between host and two controllers can flow in round-robin fashion. So redundant path here is used only as a failover. Software which provides this failover functionality is called Multipath I/O (MPIO) and has implementations for all operating systems. I’ll desribe how to configure MPIO version for Windows.

Installation

Prior to Windows Server 2008, Microsoft didn’t have its own MPIO implementation and MPIO was distributed with IBM DS Storage Manager product. Now you can install MPIO from “Feautures” sub-menu of Windows Server 2008 Server Manager. After installation is complete you will find MPIO configuration options under Control Panel and in Administrative Tools.

IBM storage works well with default Windows MPIO implementation, however it’s recommended to install IBM MPIO (device-specific module) from Storage Manager installation bundle. In my case MPIO installation file was called SMIA-WSX64-01.03.0305.0608.

Enable multipathing

Initially you will see two hard drives for each LUN in Device Manager. You can enable MPIO for particular hardware ID (in other words, storage system) on Discover Multi-Paths tab of MPIO control panel. You can’t do that with LUN granularity. After you add selected devices and reboot, you will see them on “MPIO Devices” tab. Now each LUN will be seen as a single hard drive in Device Manager.

Configure preferred path

MPIO supports several load-balancing policies, which are configured on a LUN basis from MPIO tab of a hard drive in Device Manager. As a Load Balance Policy select Fail Over Only. Then for each path select which is Active/Optimized and which is a Standby path. Also make active path Preferred, so that after failover it failbacks to it.

Don’t be confused by iSCSI on the figure. It’s the same for pure FC. It’s just for reference.

Check configuration

When you configure active and passive paths you assume that first path listed is to controller A and second path is to controller B. But, in fact, there is no indication of that from the configuration page and you can neither confirm nor deny it. The only ID you see is adapter ports but they don’t even map to the actual ports on HBAs.

To be able to check your configuration you need to install IBM SMdevices utility which comes with IBM DS Storage Manager. Run DS SM installation and go for Custom Installation. There you need to check only the Utilities part. In SMdevices output you can see which path is preferred for this LUN and if it’s configured as active (In Use):

C:\Program Files\IBM_DS\util>SMdevices
IBM System Storage DS Storage Manager Devices
. . .
\\.\PHYSICALDRIVE1 [Storage Subsystem ITSO5300, Logical
Drive 1, LUN 0, Logical Drive ID
<600a0b80002904de000005a246d82e17>, Preferred Path
(Controller-A): In Use]

References

The best reference I found on that topic is IBM Midrange System Storage Hardware Guide (SG24-7676-01), from p.453: DS5000 logical drive representation in Windows Server 2008. As well as Installing and Configuring MPIO guide from Microsoft.

Consistent VMware snapshots on NetApp

March 16, 2012

If you use NetApp as a storage for you VMware hard drives, it’s wise to utilize NetApp’s powerful snapshot capabilities as an instant backup tool. I shortly mentioned in my previous post that you should disable default snapshot schedule. Snapshot is done very quickly on NetApp, but still it’s not instantaneous. If VM is running you can get .vmdks which have inconsistent data. Here I’d like to describe how you can perform consistent snapshots of VM hard drives which sit on NetApp volumes exported via NFS. Obviously it won’t work for iSCSI LUNs since you will have LUNs snapshots which are almost useless for backups.

What makes VMware virtualization platform far superior to other well-known solutions in the market is VI APIs. VI API is a set of Web services hosted on Virtual Center and ESX hosts that provides interfaces for all components and operations. Particularly, there is a Perl interface for VI API which is called VMware Infrastructure Perl Toolkit. You can download and install it for free. Using VI Perl Toolkit you can write a script which will every day put your VMs in a so called hot backup mode and make NetApp snapshots as well. Practically, hot backup mode is also a snapshot. When you create a VM snapshot, original VM hard drive is left intact and VMware starts to write delta in another file. It means that VM hard drive won’t change when making NetApp snapshot and you will get consistent .vmdk files. Now lets move to implementation.

I will write excerpts from the actual script here, because lines in the script are quite long and everything will be messed up on the blog page. I uploaded full script on FileDen. Here is the link. I apologize if you read this blog entry far later than it was published and my account or the FileDen service itself no longer exist.

VI Perl Toolkit is effectively a set of Perl scripts which you run as ready to use utilities. We will use snapshotmanager.pl which lets you create VMware VM snapshots. In the first step you make snapshots of all VMs:

\”$perl_path\perl\” -w \”$perl_toolkit_path\snapshotmanager.pl\” –server vc_ip –url https://vc_ip/sdk/vimService –username snapuser –password 123456  –operation create –snapshotname \”Daily Backup Shapshot\”

For the sake of security I created Snapshot Manager role and respective user account in Virtual Center with only two allowed operations: Create Snapshot and Remove Snapshot. Run line is self explanatory. I execute it using system($run_line) command.

After VM snapshots are created you make a NetApp snapshot:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap create vm_sata snap_name

To connect to NetApp terminal I use PuTTY ssh client. putty.exe itself has a GUI and plink.exe is for batch scripting. Using this command you create snapshot of particular NetApp volume. Those which hold .vmdks in our case.

To get all VMs from hot backup mode run:

\”$perl_path\perl\” -w \”$perl_toolkit_path\snapshotmanager.pl\” –server vc_ip –url https://vc_ip/sdk/vimService –username snapuser –password 123456  –operation remove –snapshotname \”Daily Backup Shapshot\”  –children 0

By –children 0 here we tell not to remove all children snapshots.

After we familiarized ourselves with main commands, lets move on to the script logic. Apparently you will want to have several snapshots. For example 7 of them for each day of the week. It means each day, before making new snapshot you will need to remove oldest and rename others. Renaming is just for clarity. You can name your snapshots vmsnap.1, vmsnap.2, … , vmsnap.7. Where vmsnap.7 is the oldest. Each night you put your VMs in hot backup mode and delete the oldest snapshot:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap delete vm_sata vmsnap.7

Then you rename other snapshots:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.6 vmsnap.7
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.5 vmsnap.6
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.4 vmsnap.5
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.3 vmsnap.4
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.2 vmsnap.3

And create the new one:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap create vm_sata vmsnap.1

As a last step you bring your VMs out of hot backup mode.

Using this technique you can create short term backups of your virtual infrastructure and use them for long term retention with help of standalone backup solutions. Like backing up data from snapshots to tape library using Symantec BackupExec. I’m gonna talk about this in my later posts.