Posts Tagged ‘volume’

AWS Cloud Protection Manager Part 3: Backup and Restore

August 21, 2017

Backup

Backups are created according to the schedule specified in the backup policy. We discussed how to configure backup policies in the previous blog post of the series. The list of backups you see on the Backup Monitor tab are your restore points. Backups that are older then the specified retention policy will be purged from the list and you will not see them there, unless you move them to “Freezer”.

It is important to understand that apart from volume snapshots, for each backed up instance CPM also creates an AMI. Those who has hands-on experience with AWS may already know, that AMIs is the only way to create clones of Windows EC2 instances in AWS. If you go to AWS console and try to find a clone action under the instance Action menu, you won’t find any. You will have “Create Image” instead. It creates an AMI, from which you can then spin up a clone of an instance the image was created from.

CPM does exactly that. For each backup policy the instance is under, it creates one AMI. In our example we have four backup policies, that will result in four AMIs for each of the instances. Every AMI has to have at least one storage volume. So CPM will include the root volume of each instance into AMI, just because it has to. But AMIs are required only to restore EC2 instance configuration. Data is restored from volume snapshots, that can be used to create new volumes from them and then attach them to the instance. You can click on the View button under Snapshots to find the corresponding snapshot and AMI IDs.

There is a backup log for each job run as well that is helpful for issue troubleshooting.

Restore

To perform a restore click on the Recover button next to the backup job and you will get the list of the instances you can recover. CPM offers you three options: instance recovery, volume recovery and file recovery. Let’s go back to front.

File recovery is probably the most used recovery option. As it lets you restore individual files. When you click on the “Explore” button, CPM creates new volumes from the snapshots you are restoring from and mount them to the CPM instance. You are then presented with a simple file system browser where you can find the file and click on the green down arrow icon in Download column to save the file to your computer.

If you click on “Volume Only”, you can restore particular volumes. Restored volumes are not attached to any instance, unless you specify it under “Attach to Instance” column. You can then select under “Attach Behaviour” what CPM should do if such volume is already attached to the instance or if you want to automatically detach the original volume, but the instance is running (you can do it only if instance is stopped).

And the last option is “Instance”. It will create a clone of the original instance using the pre-generated AMI and volume snapshots, as we discussed in the Backup section of this blog post. You can specify many options under Advanced Options section, including recovery to another VPC or different availability zone. If anything, make sure you specify a new IP address for the instance, otherwise you’ll have a conflict and your restore will fail. Ideally you should also shut down the original EC2 instance before spinning up a restore clone.

Advanced Features

There are quite a few worth mentioning. So far we have looked at simple EC2 instance restore. But you don’t have to backup whole instances, you can also backup individual volumes. On top of that, CPM supports RDS database, Aurora and Redshift cluster backups.

If you run MS Exchange, Sharepoint or SQL on your EC2 instances, you can install CPM backup agent on them to ensure you have application-consistent backups via VSS, as opposed to crash-consistent backups you get if agent is not used. If you install the agent, you can also run a script on the instance before and after the backup is taken.

Last but not least is DR. Restoring to another availability zone within the region is already supported on instance recovery level. You can choose availability zone you want to restore to. It is not possible to recover to another region, though. Because AWS snapshots and AMIs are local to the region they are created in. If you want to be able to recover to another region, you can configure DR in CPM, which will utilise AWS AMI and snapshot copy functionality to copy backups to another region at configured frequency.

Conclusion

Overall, I found Cloud Protection Manager very easy to install, configure and use. If you come from infrastructure background, at first glance CPM may look to you like a very basic tool, compared to such feature-rich solutions like Veeam or Commvault. But that feeling is misleading. CPM is simple, because AWS simple. All infrastructure complexity is hidden under the covers. As a result, all AWS backup tools need to do is create snapshots and CPM does it well.

How to move aggregates between NetApp controllers

September 25, 2013

Stop Sign_91602

 

DISCLAMER: I ACCEPT NO RESPONSIBILITY FOR ANY DAMAGE OR CORRUPTION OF DATA THAT MAY OCCUR AS A RESULT OF CARRYING OUT STEPS DESCRIBED BELOW. YOU DO THIS AT YOUR OWN RISK.

 

We had an issue with high CPU usage on one of the NetApp controllers servicing a couple of NFS datastores to VMware ESX cluster. HA pair of FAS2050 had two shelves, both of them owned by the first controller. The obvious solution for us was to reassign disks from one of the shelves to the other controller to balance the load. But how do you do this non-disruptively? Here is the plan.

In our setup we had two controllers (filer1, filer2), two shelves (shelf1, shelf2) both assigned to filer1. And two aggregates, each on its own shelf (aggr0 on shelf0, aggr1 on shelf1). Say, we want to reassign disks from shelf2 to filer2.

First step is to migrate all of the VMs from the shelf2 to shelf1. Because operation is obviously disruptive to the hosts accessing data from the target shelf. Once all VMs are evacuated, offline all volumes and an aggregate, to prevent any data corruption (you can’t take aggregate offline from online state, so change it to restricted first).

If you prefer to reassign disks in two steps, as described in NetApp Professional Services Tech Note #021: Changing Disk Ownership, don’t forget to disable automatic ownership assignment on both controllers, otherwise disks will be assigned back to the same controller again, right after you unown them:

> options disk.auto_assign off

It’s not necessary if you change ownership in one step as shown below.

Next step is to actually reassign the disks. Since they are already part of an aggregate you will need to force the ownership change:

filer1> disk assign 1b.01.00 -o filer2 -f

filer1> disk assign 1b.01.01 -o filer2 -f

filer1> disk assign 1b.01.nn -o filer2 -f

If you do not force disk reassignment you will get an error:

Assign request failed for disk 1b.01.0. Reason:Disk is part of a failed or offline aggregate or volume. Changing its owner may prevent aggregate or volume from coming back online. Ownership may be changed only by using the appropriate force option.

When all disks are moved across to filer2, new aggregate will show up in the list of aggregates on filer2 and you’ll be able to bring it online. If you can’t see the aggregate, force filer to rescan the drives by running:

filer2> disk show

The old aggregate will still be seen in the list on filer1. You can safely remove it:

filer1> aggr destroy aggr1

Disabling NetApp Growth Rate Abnormal Alert

July 1, 2013

NetApp DataFabric Manager has one annoying alert, which notifies that space utilization of the volume increases more quickly than expected. I used to receive dozens of these alerts each morning until I had done the following:

> dfm eventtype modify -v information volume-growth-rate:abnormal

This command changes the default severity for this event from warning to informational. Since my notifications configured to send everything higher or equal to warning, I no longer receive this alert.

There is a “volume full” event which triggers at 90%, I believe, which is enough for me.

Resize limits for SAN LUNs

July 1, 2013

If you run lun resize command on NetApp you might run into the following error:

lun resize: New size exceeds this LUN’s initial geometry

The reason behind it is that each SAN LUN has head/cylinder/sector geometry. It’s not an actual physical mapping to the underlying disks and has no meaning these days. It’s simply a SCSI protocol artifact. But it imposes limitation on maximum LUN resize. Geometry is chosen at initial LUN creation and cannot be changed. Roughly you can resize the LUN to the size, which is 10 times bigger than the size at the time of creation. For example, the 50GB LUN can be extended to the maximum of 502GB. See the table below for the maximum sizes:

Initial Size   Maximum Sizedata-storage
< 50g          502g
51-100g        1004g
101-150g       1506g
151-200g       2008g
201-251g       2510g
252-301g       3012g
302-351g       3514g
352-401g       4016g

To check the maximum size for particular LUN use the following commands:

> priv set diag
> lun gemetry lun_path
> priv set

If you run into this issue, unfortunately you will need to create a new LUN, copy all the data using robocopy for example and make a cutover. Because such features as volume level SnapMirror or ndmpcopy will recreate the LUN’s geometry together with the data.

Magic behind NetApp VSC Backup/Restore

June 12, 2013

netapp_dpNetApp Virtual Storage Console is a plug-in for VMware vCenter which provides capabilities to perform instant backup/restore using NetApp snapshots. It uses several underlying NetApp features to accomplish its tasks, which I want to describe here.

Backup Process

When you configure a backup job in VSC, what VSC does, is it simply creates a NetApp snapshot for a target volume on a NetApp filer. Interestingly, if you have two VMFS datastores inside one volume, then both LUNs will be snapshotted, since snapshots are done on the volume level. But during the datastore restore, the second volume will be left intact. You would think that if VSC reverts the volume to the previously made snapshot, then both datastores should be affected, but that’s not the case, because VSC uses Single File SnapRestore to restore the LUN (this will be explained below). Creating several VMFS LUNs inside one volume is not a best practice. But it’s good to know that VSC works correctly in this case.

Same thing for VMs. There is no sense of backing up one VM in a datastore, because VSC will make a volume snapshot anyway. Backup the whole datastore in that case.

Datastore Restore

After a backup is done, you have three restore options. The first and least useful kind is a datastore restore. The only use case for such restore that I can think of is disaster recovery. But usually disaster recovery procedures are separate from backups and are based on replication to a disaster recovery site.

VSC uses NetApp’s Single File SnapRestore (SFSR) feature to restore a datastore. In case of a SAN implementation, SFSR reverts only the required LUN from snapshot to its previous state instead of the whole volume. My guess is that SnapRestore uses LUN clone/split functionality in background, to create new LUN from the snapshot, then swap the old with the new and then delete the old. But I haven’t found a clear answer to that question.

For that functionality to work, you need a SnapRestore license. In fact, you can do the same trick manually by issuing a SnapRestore command:

> snap restore -t file -s nightly.0 /vol/vol_name/vmfs_lun_name

If you have only one LUN in the volume (and you have to), then you can simply restore the whole volume with the same effect:

> snap restore -t vol -s nightly.0 /vol/vol_name

VM Restore

VM restore is also a bit controversial way of restoring data. Because it completely removes the old VM. There is no way to keep the old .vmdks. You can use another datastore for particular virtual hard drives to restore, but it doesn’t keep the old .vmdks even in this case.

VSC uses another mechanism to perform VM restore. It creates a LUN clone (don’t confuse with FlexClone,which is a volume cloning feature) from a snapshot. LUN clone doesn’t use any additional space on the filer, because its data is mapped to the blocks which sit inside the snapshot. Then VSC maps the new LUN to the ESXi host, which you specify in the restore job wizard. When datastore is accessible to the ESXi host, VSC simply removes the old VMDKs and performs a storage vMotion from the clone to the active datastore (or the one you specify in the job). Then clone is removed as part of a clean up process.

The equivalent cli command for that is:

> lun clone create /vol/clone_vol_name -o noreserve -b /vol/vol_name nightly.0

Backup Mount

Probably the most useful way of recovery. VSC allows you to mount the backup to a particular ESXi host and do whatever you want with the .vmdks. After the mount you can connect a virtual disk to the same or another virtual machine and recover the data you need.

If you want to connect the disk to the original VM, make sure you changed the disk UUID, otherwise VM won’t boot. Connect to the ESXi console and run:

# vmkfstools -J setuuid /vmfs/volumes/datastore/VM/vm.vmdk

Backup mount uses the same LUN cloning feature. LUN is cloned from a snapshot and is connected as a datastore. After an unmount LUN clone is destroyed.

Some Notes

VSC doesn’t do a good cleanup after a restore. As part of the LUN mapping to the ESXi hosts, VSC creates new igroups on the NetApp filer, which it doesn’t delete after the restore is completed.

What’s more interesting, when you restore a VM, VSC deletes .vmdks of the old VM, but leaves all the other files: .vmx, .log, .nvram, etc. in place. Instead of completely substituting VM’s folder, it creates a new folder vmname_1 and copies everything into it. So if you use VSC now and then, you will have these old folders left behind.

Unexpected Deduplication Impact on VMware I/O Latency

May 28, 2013

NetApp deduplication is a postponed process. During normal operation Data ONTAP only calculates hashes for the data blocks. Actual deduplication is carried out off-hours as per configured schedule. Hash calculation doesn’t affect performance in most cases. I talked about that in my previous post. NetApp states in its documentation that deduplication is a low-priority process:

When one deduplication process is running, there is 0% to 15% performance degradation on other applications.

Once I faced a situation when deduplication was configured to be carried out during business hours on one of the volumes. No one noticed that at some point volume run out of space and Data ONTAP wasn’t able to perform deduplication from that time. Situation became worse when Data ONTAP was upgraded from version 7.3.2 to 8.1.0. Now during deduplication filer tried to upgrade the fingerprint metadata to a new version at 15:00 every day with the message: “Fingerprint is being upgraded” and failed. It seems that the metadata upgrade is a very resource-intensive process and heavily affects I/O latency.

This volume was not a VMware datastore, but it sit on the same aggregate together with the several VMFS LUNs. Here what happened to the VMware I/O latency every day at 15:00 (click to enlarge):

dedup_issue_ed

I deleted the host name and the datastores names from the graph. You can see the large latency spike, which won’t turn yourVMs into kernel panic, but it’s not the thing you would want your production environment to experience every day.

The solution was simple. After space was increased on this volume, deduplication metadata upgrade performed successfully and problem went away. Additionally, deduplication was shifted to off-hours.

The simple lesson to learn: don’t schedule deduplication during the day, you never know what could possibly go wrong.

NetApp Reallocate

May 24, 2013

top-defragment-2The smallest addressable block of data in Data ONTAP is 4k. However, all data is written to volumes in 256k chunks. When data block which is bigger than 256k comes in, filer searches for contiguous 256k of free space in the file system. If it’s found, data block is written into it, if not then filer splits the data block and puts it in several places. It’s called fragmentation and is familiar to everyone from the times, when FAT files ystems were in use. It’s not a big issue in modern file systems, like NTFS or WAFL, but defragmentation can help to solve performance problems in some situations.

In mostly random read/write environments (which is quite common these days) fragmentation has no impact on performance. If you write or read data from random places of the hard drive it doesn’t matter if this data is random or sequential on the physical media. NetApp recommends to consider defragmentation for the applications with sequential read type of workload:

  • Online transaction processing databases that perform large table scans
  • E-mail systems that use database storage with verification processes
  • Host-side backup of LUNs

Reallocation process uses thresholds values to represent the file system layout optimization level, where 4 is normal and everything bigger than 10 is not optimal.

To check the level of optimization for particular volume use:

> reallocate measure –o /vol/vol_name

If you decide to run reallocate on the volume, run:

> reallocate start –f /vol/vol_name

There are certain considerations if you’re using snapshots or deduplication on volumes. There is a “-p” option, to prevent inflating snapshots during reallocate. And from version 8.1 Data ONTAP also supports reallocation of deduplicated volumes. Consult official documentation for additional information.

Further reading:

TR-3929: Reallocate Best Practices

NetApp Deduplication in a Nutshell

May 12, 2013

NetApp-Dedupe2NetApp uses Write Anywhere File Layout (WAFL) filesystem which is a key for NetApp’s efficient snapshot technology. If you’re already familiar with how snapshots are implemented in Data ONTAP, then understanding underlying mechanisms of deduplication is simple. Filer calculates hash for each data block it receives and preserves this in the form of metadata on the volume level. Then according to deduplication schedule (usually on weekends), filer runs metadata processing and for each duplicate hash changes data pointer to the original block of data.

Since for each data block filer needs to calculate hash on the fly, it has its penalty. On systems with CPU loaded by less than 50% performance impact is negligible. For heavily loaded systems, where CPU is nearly 100% busy, performance impact can be around 15%. For high-end 6000 systems penalty can jump up to 35% for random writes. Heavy sequential read operations can also suffer from deduplication, because read operations can be rerouted in random way across physical storage, depending on where the original data block actually is. In general, deduplication has low impact on system performance. But you can’t use it blindly and should keep in mind that in particular cases it can slow down your storage system.

Deduplication configuration is pretty simple. First of all, you need to activate deduplication on particular volume:

> sis on “targetvol”

If you already have data on the volume, you need to scan it. Otherwise, it won’t be deduped. It’s a common mistake. Deduplication is a low priority task, but keep in mind, however, that it can slightly impact your storage performance when done during business hours, especially if you run deduplication for several volumes simultaneously.

> sis start -s “targetvol”

To show the status of deduplication for particular volume:

> sis status “targetvol”

To see the deduplication schedule:

> sis config “targetvol”

And the most pleasant command to find out how much data you’ve saved:

> df -s “targetvol”

If you want to undo deduplication, first switch it off and then undo it using the following commands:

> sis off “targetvol”
> priv set advanced
*> sis undo “targetvol”
*> priv set

I, personally, was able to achieve 40% deduplication rate for VMware VMFS datastores, which is rather impressive, considering these were mixed OS + application data LUNs.

As a final note, I would like to point out, that deduplication is suitable only for environments with high percentage of similar data. VMware is a good example of it. You won’t get any significant deduplication ratio for swap file volumes, Exchnage mailboxes or Symantec Enterprise Vault which is already deduped.

Further reading:

TR-3958: NetApp Data Compression and Deduplication Deployment and Implementation Guide: Data ONTAP 8.1 Operating in 7-Mode

Consistent VMware snapshots on NetApp

March 16, 2012

If you use NetApp as a storage for you VMware hard drives, it’s wise to utilize NetApp’s powerful snapshot capabilities as an instant backup tool. I shortly mentioned in my previous post that you should disable default snapshot schedule. Snapshot is done very quickly on NetApp, but still it’s not instantaneous. If VM is running you can get .vmdks which have inconsistent data. Here I’d like to describe how you can perform consistent snapshots of VM hard drives which sit on NetApp volumes exported via NFS. Obviously it won’t work for iSCSI LUNs since you will have LUNs snapshots which are almost useless for backups.

What makes VMware virtualization platform far superior to other well-known solutions in the market is VI APIs. VI API is a set of Web services hosted on Virtual Center and ESX hosts that provides interfaces for all components and operations. Particularly, there is a Perl interface for VI API which is called VMware Infrastructure Perl Toolkit. You can download and install it for free. Using VI Perl Toolkit you can write a script which will every day put your VMs in a so called hot backup mode and make NetApp snapshots as well. Practically, hot backup mode is also a snapshot. When you create a VM snapshot, original VM hard drive is left intact and VMware starts to write delta in another file. It means that VM hard drive won’t change when making NetApp snapshot and you will get consistent .vmdk files. Now lets move to implementation.

I will write excerpts from the actual script here, because lines in the script are quite long and everything will be messed up on the blog page. I uploaded full script on FileDen. Here is the link. I apologize if you read this blog entry far later than it was published and my account or the FileDen service itself no longer exist.

VI Perl Toolkit is effectively a set of Perl scripts which you run as ready to use utilities. We will use snapshotmanager.pl which lets you create VMware VM snapshots. In the first step you make snapshots of all VMs:

\”$perl_path\perl\” -w \”$perl_toolkit_path\snapshotmanager.pl\” –server vc_ip –url https://vc_ip/sdk/vimService –username snapuser –password 123456  –operation create –snapshotname \”Daily Backup Shapshot\”

For the sake of security I created Snapshot Manager role and respective user account in Virtual Center with only two allowed operations: Create Snapshot and Remove Snapshot. Run line is self explanatory. I execute it using system($run_line) command.

After VM snapshots are created you make a NetApp snapshot:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap create vm_sata snap_name

To connect to NetApp terminal I use PuTTY ssh client. putty.exe itself has a GUI and plink.exe is for batch scripting. Using this command you create snapshot of particular NetApp volume. Those which hold .vmdks in our case.

To get all VMs from hot backup mode run:

\”$perl_path\perl\” -w \”$perl_toolkit_path\snapshotmanager.pl\” –server vc_ip –url https://vc_ip/sdk/vimService –username snapuser –password 123456  –operation remove –snapshotname \”Daily Backup Shapshot\”  –children 0

By –children 0 here we tell not to remove all children snapshots.

After we familiarized ourselves with main commands, lets move on to the script logic. Apparently you will want to have several snapshots. For example 7 of them for each day of the week. It means each day, before making new snapshot you will need to remove oldest and rename others. Renaming is just for clarity. You can name your snapshots vmsnap.1, vmsnap.2, … , vmsnap.7. Where vmsnap.7 is the oldest. Each night you put your VMs in hot backup mode and delete the oldest snapshot:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap delete vm_sata vmsnap.7

Then you rename other snapshots:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.6 vmsnap.7
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.5 vmsnap.6
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.4 vmsnap.5
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.3 vmsnap.4
“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap rename vm_sata vmsnap.2 vmsnap.3

And create the new one:

“\$plink_path” -ssh -2 -batch -i \”private_key_path\” -l root netapp_ip snap create vm_sata vmsnap.1

As a last step you bring your VMs out of hot backup mode.

Using this technique you can create short term backups of your virtual infrastructure and use them for long term retention with help of standalone backup solutions. Like backing up data from snapshots to tape library using Symantec BackupExec. I’m gonna talk about this in my later posts.