Posts Tagged ‘LUN’

History of vSphere Storage Size Limitations

June 5, 2016

data-storageThis seems like a straightforward topic. If you are on vSphere 6 you can create VMFS datastores and VM disks as big as 64TB (62TB for VM disks to be precise). The reality is, not all customers are running the latest and greatest for various reasons. The most common one is concerns about reliability. vSphere 6 is still at version 6.0. Once vSphere 6.1 comes out we will see wider adoption. At this stage I see various versions of vSphere 5 in the field. And even vSphere 4 at times, which is officially not supported by VMware since May 2015. So it’s not surprising I still get this question, what are the datastore and disk size limits for various vSphere versions?

Datastore size limit

The biggest datastore size for both VMFS3 and VMFS5 is 64TB. What you need to know is, VMFS3 file system uses MBR partition style, which is limited to 2TB. The way VMFS3 overcomes this limitation is by using extents. To extend VMFS3 partition to 64TB you would need 32 x 2TB LUNs on the storage array. VMFS5 file system has GPT partition style and can be extended to 64TB by expanding one underlying LUN without using extents, which is a big plus.

VMFS3 datastores are rare these days, unless you’re still on vSphere 4. The only consideration here is whether your vSphere 5 environment was a greenfield build. If the answer is yes, then all your datastores are VMFS5 already. If environment was upgraded from vSphere 4, you need to make sure all datastores have been upgraded (or better recreated) to VMFS5 as well. If the upgrade wasn’t done properly you may still have some VMFS3 datastores in your environment.

Disk size limit

For .vmdk disks the limitation had been 2TB for a long time, until VMware increased the limit to 62TB in vSphere 5.5. So if all of your datastores are VMFS5, this means you still have 2TB  .vmdk limitation if you’re on 5.0 or 5.1.

For VMFS3 file system you also had an option to choose block size – 1MB, 2MB, 4MB or 8MB. 2TB .vmdk disks were supported only with the 8MB block size. The default was 1MB. So if you chose the default block size during datastore creation you were limited to 256GB .vmdk disks.

The above limits led to proliferation of Raw Device Mapping disks in many pre 5.5 environments. Those customers who needed VM disks bigger than 2TB had to use RDMs, as physical RDMs starting from VMFS5 supported 64TB (pRDMs on VMFS3 were still limited to 2TB).

This table summarises storage configuration maximums for vSphere version 4.0 to 6.0:

vSphere Datastore Size VMDK Size pRDM Size
4.0 64TB 2TB 2TB
4.1 64TB 2TB 2TB
5.0 64TB 2TB 64TB
5.1 64TB 2TB 64TB
5.5 64TB 62TB 64TB
6.0 64TB 64TB 64TB

Summary

The bottom line is, if you’re on vSphere 5.5 or 6.0 and all your datastores are VMFS5 you can forget about the legacy .vmdk disk size limitations. And if you’re not on vSphere 5.5 you should consider upgrading as soon as possible, as vSphere 5.0 and 5.1 are coming to an end of support on 24 of August 2016.

Advertisements

Changing the Default PSP for Dell Compellent

April 26, 2016

dell_compellentIf you’ve ever worked with Dell Compellent storage arrays you may have noticed that when you initially connect it to a VMware ESXi host, by default VMware Native Multipathing Plugin (NMP) uses Fixed Path Selection Policy (PSP) for all connected LUNs. And if you have two ports on each of the controllers connected to your storage area network (be it iSCSI or FC), then you’re wasting half of your bandwidth.

compellent_psp

Why does that happen? Let’s dig deep into VMware’s Pluggable Storage Architecture (PSA) and see how it treats Compellent.

How Compellent is claimed by VMware NMP

If you are familiar with vSphere’s Pluggable Storage Architecture (PSA) and NMP (which is the only PSA plug-in that every ESXi host has installed by default), then you may know that historically it’s always had specific rules for such Asymmetric Logical Unit Access (ALUA) arrays as NetApp FAS and EMC VNX.

Run the following command on an ESXi host and you will see claim rules for NetApp and DGC devices (DGC is Data General Corporation, which built Clariion array that has been later re-branded as VNX by EMC):

# esxcli storage nmp satp rule list

Name              Vendor  Default PSP Description
----------------  ------- ----------- -------------------------------
VMW_SATP_ALUA_CX  DGC                 CLARiiON array in ALUA mode
VMW_SATP_ALUA     NETAPP  VMW_PSP_RR  NetApp arrays with ALUA support

This tells NMP to use Round-Robin Path Selection Policy (PSP) for these arrays, which is always preferable if you want to utilize all available active-optimized paths. You may have noticed that there’s no default PSP in the VNX claim rule, but if you look at the default PSP for the VMW_SATP_ALUA_CX Storage Array Type Plug-In (SATP), you’ll see that it’s also Round-Robin:

# esxcli storage nmp satp list

Name              Default PSP  
----------------- -----------
VMW_SATP_ALUA_CX  VMW_PSP_RR

There is, however, no default claim rule for Dell Compellent storage arrays. There are a handful of the following non array-specific “catch all” rules:

Name                 Transport  Claim Options Description
-------------------  ---------  ------------- -----------------------------------
VMW_SATP_ALUA                   tpgs_on       Any array with ALUA support
VMW_SATP_DEFAULT_AA  fc                       Fibre Channel Devices
VMW_SATP_DEFAULT_AA  fcoe                     Fibre Channel over Ethernet Devices
VMW_SATP_DEFAULT_AA  iscsi                    iSCSI Devices

As you can see, the default PSP for VMW_SATP_ALUA is Most Recently Used (MRU) and for VMW_SATP_DEFAULT_AA it’s VMW_PSP_FIXED:

Name                Default PSP   Description
------------------- ------------- ------------------------------------------
VMW_SATP_ALUA       VMW_PSP_MRU
VMW_SATP_DEFAULT_AA VMW_PSP_FIXED Supports non-specific active/active arrays

Compellent is not an ALUA storage array and doesn’t have the tpgs_on option enabled. As a result it’s claimed by the VMW_SATP_DEFAULT_AA rule for the iSCSI transport, which is why you end up with the Fixed path selection policy for all LUNs by default.

Changing the default PSP

Now let’s see how we can change the PSP from Fixed to Round Robin. First thing you have to do before even attempting to change the PSP is to check VMware Compatibility List to make sure that the round robin PSP is supported for a particular array and vSphere combination.

vmware_hcl

As you can see, round robin path selection policy is supported for Dell Compellent storage arrays in vSphere 6.0u2. So let’s change it to get the benefit of being able to simultaneously use all paths to Compellent controllers.

For Compellent firmware versions 6.5 and earlier use the following command to change the default PSP:

# esxcli storage nmp satp set -P VMW_PSP_RR -s VMW_SATP_DEFAULT_AA

Note: technically here you’re changing PSP not specifically for the Compellent storage array, but for any array which is claimed by VMW_SATP_DEFAULT_AA and which also doesn’t have an individual SATP rule with PSP set. Make sure that this is not the case or you may accidentally change PSP for some other array you may have in your environment.

The above will change PSP for any newly provisioned and connected LUNs. For any existing LUNs you can change PSP either manually in each LUN’s properties or run the following command in PowerCLI:

# Get-Cluster ClusterNameHere | Get-VMHost | Get-ScsiLun | where {$_.Vendor -eq
“COMPELNT” –and $_.Multipathpolicy -eq “Fixed”} | Set-ScsiLun -Multipathpolicy
RoundRobin

This is what you should see in LUN properties as a result:

compellent_psp_2

Conclusion

By default any LUN connected from a Dell Compellent storage array is claimed by NMP using Fixed path selection policy. You can change it to Round Robin using the above two simple commands to make sure you utilize all storage paths available to ESXi hosts.

Masking a VMware LUN

February 7, 2016

maskingA month ago I passed my VCAP-DCA exam, which I blogged about in this post. And one of the DCA exam topics in the blueprint was LUN masking using PSA-related commands.

Being honest, I can hardly imagine a use case for this as LUN masking is always done on the storage array side. I’ve never seen LUN masking done on the hypervisor side before.

If you have a use case for host LUN masking leave me a comment below. I’d be curious to know. But regardless of its usefulness it’s in the exam, so we have to study it, right? So let’s get to it.

Overview

There are many blog posts on the Internet on how to do VMware LUN Masking, but only a few explain what is the exact behaviour after you type each of the commands and how to fix the issues, which you can potentially run into.

VMware uses Pluggable Storage Architecture (PSA) to claim devices on ESXi hosts. All hosts have one plug-in installed by default called Native Multipathing Plug-in (NMP) which claims all devices. Masking of a LUN is done by unclaiming it from NMP and claiming using a special plug-in called MASK_PATH.

Namespace “esxcli storage core claimrule add” is used to add new claim rules. The namespace accepts multiple ways of addressing a device. Most widely used are:

  • By device ID:
    • -t device -d naa.600601604550250018ea2d38073cdf11
  • By location:
    • -t location -A vmhba33 -C 0 -T 0 -L 2
  • By target:
    • -t target  -R iscsi -i iqn.2011-03.example.org.istgt:iscsi1 -L 0
    • -t target -R fc –wwnn 50:06:01:60:ba:60:11:53 –wwpn 50:06:01:60:3a:60:11:53 (use double dash for wwnn and wwpn flags, WordPress strips them off)

To determine device names use the following command:

# esxcli storage core device list

To determine iSCSI device targets:

# esxcli iscsi session list

To determine FC paths, WWNNs and WWPNs:

# esxcli storage core path list

Mask an iSCSI LUN

Let’s take iSCSI as an example. To mask an iSCSI LUN add a new claim rule using MASK_PATH plug-in and addressing by target (for FC use an FC target instead):

# esxcli storage core claimrule add -r 102 -t target -R iscsi -i iqn.2011-03.example.org.istgt:iscsi1 -L 0 -P MASK_PATH

Once the rule is added you MUST load it otherwise the rule will not work:

# esxcli storage core claimrule load

Now list the rules and make sure there is a “runtime” and a “file” rule. Without the file rule masking will not take effect:

claimrule

The last step is to unclaim the device from the NMP plug-in which currently owns it and apply the new set of rules:

# esxcli storage core claiming unclaim -t location -A vmhba33 -C 0 -T 0 -L 0
# esxcli storage core claiming unclaim -t location -A vmhba33 -C 1 -T 0 -L 0
# esxcli storage core claimrule run

You can list devices connected to the host to confirm that the masked device is no longer in the list:

# esxcli storage core device list

Remove maskig

To remove masking, unclaim the device from MASK_PATH plug-in, delete the masking rule and reload/re-run the rule set:

# esxcli storage core claiming unclaim -t location -A vmhba33 -C 0 -T 0 -L 0
# esxcli storage core claiming unclaim -t location -A vmhba33 -C 1 -T 0 -L 0
# esxcli storage core claimrule remove -r 102
# esxcli storage core claimrule load
# esxcli storage core claimrule run

Sometimes you need to reboot the host for the device to reappear.

Conclusion

Make sure to always mask all targets/paths to the LUN, which is true for iSCSI as well as FC, as both support multipathing. You have a choice of masking by location, target and path (masking by device is not supported).

For a FC LUN, for instance, you may choose to mask the LUN by location. If you have two single port FC adapters in each host, you will typically be masking four paths per LUN.  To accomplish that specify adapters using flag -A and LUN ID using flags -C, -T and -L.

Hope that helps you to tick off this exam topic from the blueprint.

Dell Compellent iSCSI Configuration

November 20, 2015

I haven’t seen too many blog posts on how to configure Compellent for iSCSI. And there seem to be some confusion on what the best practices for iSCSI are. I hope I can shed some light on it by sharing my experience.

In this post I want to talk specifically about the Windows scenario, such as when you want to use it for Hyper-V. I used Windows Server 2012 R2, but the process is similar for other Windows Server versions.

Design Considerations

All iSCSI design considerations revolve around networking configuration. And two questions you need to ask yourself are, what your switch topology is going to look like and how you are going to configure your subnets. And it all typically boils down to two most common scenarios: two stacked switches and one subnet or two standalone switches and two subnets. I could not find a specific recommendation from Dell on whether it should be one or two subnets, so I assume that both scenarios are supported.

Worth mentioning that Compellent uses a concept of Fault Domains to group front-end ports that are connected to the same Ethernet network. Which means that you will have one fault domain in the one subnet scenario and two fault domains in the two subnets scenario.

For iSCSI target ports discovery from the hosts, you need to configure a Control Port on the Compellent. Control Port has its own IP address and one Control Port is configured per Fault Domain. When server targets iSCSI port IP address, it automatically discovers all ports in the fault domain. In other words, instead of using IPs configured on the Compellent iSCSI ports, you’ll need to use Control Port IP for iSCSI target discovery.

Compellent iSCSI Configuration

In my case I had two stacked switches, so I chose to use one iSCSI subnet. This translates into one Fault Domain and one Control Port on the Compellent.

IP settings for iSCSI ports can be configured at Storage Management > System > Setup > Configure iSCSI IO Cards.

iscsi_ports

To create and assign Fault Domains go to Storage Management > System > Setup > Configure Local Ports > Edit Fault Domains. From there select your fault domain and click Edit Fault Domain. On IP Settings tab you will find iSCSI Control Port IP address settings.

local_ports

control_port

Host MPIO Configuration

On the Windows Server start by installing Multipath I/O feature. Then go to MPIO Control Panel and add support for iSCSI devices. After a reboot you will see MSFT2005iSCSIBusType_0x9 in the list of supported devices. This step is important. If you don’t do that, then when you map a Compellent disk to the hosts, instead of one disk you will see multiple copies of the same disk device in Device Manager (one per path).

add_iscsi

iscsi_added

Host iSCSI Configuration

To connect hosts to the storage array, open iSCSI Initiator Properties and add your Control Port to iSCSI targets. On the list of discovered targets you should see four Compellent iSCSI ports.

Next step is to connect initiators to the targets. This is where it is easy to make a mistake. In my scenario I have one iSCSI subnet, which means that each of the two host NICs can talk to all four array iSCSI ports. As a result I should have 2 host ports x 4 array ports = 8 paths. To accomplish that, on the Targets tab I have to connect each initiator IP to each target port, by clicking Connect button twice for each target and selecting one initiator IP and then the other.

iscsi_targets

discovered_targets

connect_targets

Compellent Volume Mapping

Once all hosts are logged in to the array, go back to Storage Manager and add servers to the inventory by clicking on Servers > Create Server. You should see hosts iSCSI adapters in the list already. Make sure to assign correct host type. I chose Windows 2012 Hyper-V.

 

add_servers

It is also a best practice to create a Server Cluster container and add all hosts into it if you are deploying a Hyper-V or a vSphere cluster. This guarantees consistent LUN IDs across all hosts when LUN is mapped to a Server Cluster object.

From here you can create your volumes and map them to the Server Cluster.

Check iSCSI Paths

To make sure that multipathing is configured correctly, use “mpclaim” to show I/O paths. As you can see, even though we have 8 paths to the storage array, we can see only 4 paths to each LUN.

io_paths

Arrays such as EMC VNX and NetApp FAS use Asymmetric Logical Unit Access (ALUA), where LUN is owned by only one controller, but presented through both. Then paths to the owning controller are marked as Active/Optimized and paths to the non-owning controller are marked as Active/Non-Optimized and are used only if owning controller fails.

Compellent is different. Instead of ALUA it uses iSCSI Redirection to move traffic to a surviving controller in a failover situation and does not need to present the LUN through both controllers. This is why you see 4 paths instead of 8, which would be the case if we used an ALUA array.

References

Zerto Overview

March 6, 2014

zerto-logoZerto is a VM replication product which works on a hypervisor level. In contrast to array level replication, which SRM has been using for a long time, it eliminates storage array from the equation and all the complexities which used to come along with it (SRAs, splitting the LUNs for replicated and non-replicated VMs, potential incompatibilities between the orchestrated components, etc).

Basic Operation

Zerto consists of two components: ZVM (Zerto Virtual Manger) and VRA (Virtual Replication Appliance). VRAs are VMs that need to be installed on each ESXi host within the vCenter environment (performed in automated fashion from within ZVM console). ZVM manages VRAs and all the replication settings and is installed one per vCenter. VRA mirrors protected VMs I/O operations to the recovery site. VMs are grouped in VPGs (Virtual Protection Groups), which can be used as a consistency group or just a container.

Protected VMs can be preseeded  to DR site. But what Zerto essentially does is it replicates VM disks to any datastore on recovery site where you point it to and then tracks changes in what is called a journal volume. Journal is created for each VM and is kept as a VMDK within the “ZeRTO volumes” folder on a target datastore. Every few seconds Zerto creates checkpoints on a journal, which serve as crash consistent recovery points. So you can recover to any point in time, with a few seconds granularity. You can set the journal length in hours, depending on how far you potentially would want to go back. It can be anywhere between 1 and 120 hours.Data-Replication-over-WAN

VMs are kept unregistered from vCenter on DR site and VM configuration data is kept in Zerto repository. Which essentially means that if an outage happens and something goes really wrong and Zerto fails to bring up VMs on DR site you will need to recreate VMs manually. But since VMDKs themselves are kept in original format you will still be able to attach them to VMs and power them on.

Failover Scenarios

There are four failover scenarios within Zerto:

  • Move Operation – VMs are shut down on production site, unregistered from inventory, powered on at DR site and protection is reversed if you decide to do so. If you choose not to reverse protection, VMs are completely removed from production site and VPG is marked as “Needs Configuration”. This scenario can be seen as a planned migration of VMs between the sites and needs both sites to be healthy and operational.
  • Failover Operation – is used in disaster scenario when production site might be unavailable. In this case Zerto brings up protected VMs on DR site, but it does not try to remove VMs from production site inventory and leave them as is. If production site is still accessible you can optionally select to shutdown VMs. You cannot automatically reverse protection in this scenario, VPG is marked as “Needs Configuration” and can be activated later. And when it is activated, Zerto does all the clean up operations on the former production site: shuts down VMs (if they haven’t been already), unregister them from inventory and move to VRA folder on the datastore.
  • Failover Test Operation – this is for failover testing and brings up VMs on DR site in a configured bubble network which is normally not uplinked to any physical network. VMs continue to run on both sites. Note that VMs disk files in this scenario are not moved to VMs folders (as in two previous scenarios) and are just connected from VRA VM folder. You would also notice that Zerto created second journal volume which is called “scratch” journal. Changes to the VM that is running on DR site are saved to this journal while it’s being tested.
  • Clone Operation – VMs are cloned on DR site and connected to network. VMs are not automatically powered on to prevent potential network conflicts. This can be used for instance in DR site testing, when you want to check actual networking connectivity, instead of connecting VMs to an isolated network. Or for implementing backups, cloned environment for applications testing, etc.

Zerto Journal Sizing

By default journal history is configured as 4 hours and journal size is unlimited. Depending on data change rate within the VM journal can be smaller or larger. 15GB is approximately enough storage to support a virtual machine with 1TB of storage, assuming a 10% change rate per day with four hours of journal history saved. Zerto has a Journal Sizing Tool which helps to size journals. You can create a separate journal datastore as well.

Zerto compared to VMware Replication and SRM

There are several replication products in the market from VMware. Standalone VMware replication, VMware replication + SRM orchestraion and SRM array-based replication. If you want to know more on how they compare to Zerto, you can read the articles mentioned in references below. One apparent Zerto advantage, which I want to mention here, is integration with vCloud Director, which is essential for cloud providers who offer DRaaS solutions. SRM has no vCloud Director support.

References

NetApp VSC Single File Restore Explained

August 5, 2013

netapp_dpIn one of my previous posts I spoke about three basic types of NetApp Virtual Storage Console restores: datastore restore, VM restore and backup mount. The last and the least used feature, but very underrated, is the Single File Restore (SFR), which lets you restore single files from VM backups. You can do the same thing by mounting the backup, connecting vmdk to VM and restore files. But SFR is a more convenient way to do this.

Workflow

SFR is pretty much an out-of-the-box feature and is installed with VSC. When you create an SFR session, you specify an email address, where VSC sends an .sfr file and a link to Restore Agent. Restore Agent is a separate application which you install into VM, where you want restore files to (destination VM). You load the .sfr file into Restore Agent and from there you are able to mount source VM .vmdks and map them to OS.

VSC uses the same LUN cloning feature here. When you click “Mount” in Restore Agent – LUN is cloned, mapped to an ESX host and disk is connected to VM on the fly. You copy all the data you want, then click “Dismount” and LUN clone is destroyed.

Restore Types

There are two types of SFR restores: Self-Service and Limited Self-Service. The only difference between them is that when you create a Self-Service session, user can choose the backup. With Limited Self-Service, backup is chosen by admin during creation of SFR session. The latter one is used when destination VM doesn’t have connection to SMVI server, which means that Remote Agent cannot communicate with SMVI and control the mount process. Similarly, LUN clone is deleted only when you delete the SFR session and not when you dismount all .vmdks.

There is another restore type, mentioned in NetApp documentation, which is called Administartor Assisted restore. It’s hard to say what NetApp means by that. I think its workflow is same as for Self-Service, but administrator sends the .sfr link to himself and do all the job. And it brings a bit of confusion, because there is an “Admin Assisted” column on SFR setup tab. And what it actually does, I believe, is when Port Group is configured as Admin Assisted, it forces SFR to create a Limited Self-Service session every time you create an SFR job. You won’t have an option to choose Self-Assisted at all. So if you have port groups that don’t have connectivity to VSC, check the Admin Assisted option next to them.

Notes

Keep in mind that SFR doesn’t support VM’s with IDE drives. If you try to create SFR session for VMs which have IDE virtual hard drives connected, you will see all sorts of errors.

Monitoring ESX Storage Queues

July 30, 2013

6a00d8341c328153ef01774354e2fd970d-500wiQueue Limits

I/O data goes through several storage queues on its way to disk drives. VMware is responsible for VM queue, LUN queue and HBA queue. VM and LUN queues are usually equal to 32 operations. It means that each ESX host at any moment can have no more than 32 active operations to a LUN. Same is true for VMs. Each VM can have as many as 32 active operations to a datastore. And if multiple VMs share the same datastore, their combined I/O flow can’t go over the 32 operations limit (per LUN queue for QLogic HBAs has been increased from 32 to 64 operations in vSphere 5). HBA queue size is much bigger and can hold several thousand operations (4096 for QLogic, however I can see in my config that driver is configured with 1014 operations).

Queue Monitoring

You can monitor storage queues of ESX host from the console. Run “esxtop”, press “d” to view disk adapter stats, then press “f” to open fields selection and add Queue Stats by pressing “d”.

AQLEN column will show the queue depth of the storage adapter. CMDS/s is the real-time number of IOPS. DAVG is the latency which comes from the frame traversing through the “driver – HBA – fabric – array SP” path and should be less than 20ms. Otherwise it means that storage is not coping. KAVG shows the time which operation spent in hypervisor kernel queue and should be less than 2ms.

Press “u” to see disk device statistics. Press “f” to open the add or remove fields dialog and select Queue Stats “f”. Here you’ll see a number of active (ACTV) and queue (QUED) operations per LUN.  %USD is the queue load. If you’re hitting 100 in %USD and see operations under QUED column, then again it means that your storage cannot manage the load an you need to redistribute your workload between spindles.

Some useful documents:

VMware ESXi Core Processes

July 12, 2013

vmware_esxiThere are not much information on VMware ESXi architecture out there. There is an old whitepaper “The Architecture of VMware ESXi” which dates back to 2007. In fact, from the management perspective there are only two OS processes critical two ESXi management. These are: hostd (vmware-hostd in ESXi 4) and vpxa (vmware-vpxa in ESXi 4) which are called “Management Agents”.

hostd process is a communication layer between vmkernel and the outside world. It invokes all management operations on VMs, storage, network, etc by directly talking to the OS kernel. If hostd dies you won’t be able to connect to the host neither from vCenter nor VI client. But you will still have an option to connect directly to the host console via SSH.

vpxa is an intermediate layer between vCenter and hostd and is called “vCenter Server Agent”. If you have communication failure between vCenter and ESXi host, it’s the first thing to blame.

Say you have a storage LUN failure and hostd can’t release the datastore. You can restart hostd or both of the processes using these scripts:

# vmware-hostd restart
# vmware-vpxa restart