Posts Tagged ‘interconnect’

Dell Force10 Part 3: VLT Domain Configuration

July 31, 2016

dell-force10In my previous post here I went through VLT basics and how it helps to establish a loop-free network topology in a modern datacenter. Now lets dive deeper and see how VLT is configured from FTOS CLI.

VLT Configuration

The first step is to configure the backup links and VLT interconnect. Dell S4048-ON switches have six 40Gb QSFP+ ports, two of which 1/49 and 1/50 we will use for VLTi. Repeat the same configuration on both switches.

# int range fo 1/49-1/50
# no shutdown

# interface port-channel 127
# description “VLT interconnect”
# channel-member fo 1/49
# channel-member fo 1/50
# no shutdown

Now that we have a VLT interconnect set up, let’s join the first switch to a VLT domain:

# vlt domain 1
# back-up destination
# peer-link port-channel 127
# primary-priority 1

First switch points to the second switch management IP for a backup destination, uses port channel 127 as a VLT interconnect and becomes a primary peer, because it’s given the lowest priority of 1.

Do the same on the second switch, but now point to the first switch management IP for backup and use the highest priority to make this switch a secondary peer:

# vlt domain 1
# back-up destination
# peer-link port-channel 127
# primary-priority 8192

To confirm the VLT state use the following command:

# sh vlt brief


As you can see, the VLTi and backup links are up and the switch can see its peer. For some additional VLT specific information use these commands:

# sh vlt statistics
# sh vlt backup-link

I would also recommend to use the following command to see the port channel state and confirm that both VLTi links are in up state:

# sh int po127



In this part of the Dell Force10 switch configuration series we quickly went through the initial VLT setup. We haven’t touched on VLT LAG configuration yet. We will take a closer look at it in the next blog post.

Dell Force10 Part 2: VLT Basics

July 10, 2016

dell-force10Last time I made a blog post on initial configuration of Force10 switches, which you can find here. There I talked about firmware upgrade and basic features, such as STP and Flow Control. In this blog post I would like to touch on such a key feature of Force10 switches as Virtual Link Trunking (VLT).

VLT is Force10’s implementation of Multi-Chassis Link Aggregation Group (MLAG), which is similar to Virtual Port Channels (vPC) on Cisco Nexus switches. The goal of VLT is to let you establish one aggregated link to two physical network switches in a loop-free topology. As opposed to two standalone switches, where this is not possible.

You could say that switch stacking gives you similar capabilities and you would  be right. The issue with stacked switches, though, is that they act as a single switch not only from the data plane point of view, but also from the control plane point of view. The implication of this is that if you need to upgrade a switch stack, you have to reboot both switches at the same time, which brings down your network. If you have an iSCSI or NFS storage array connected to the stack, this may cause trouble, especially in enterprise environments.

With VLT you also have one data plane, but individual control planes. As a result, each switch can be managed and upgraded separately without full network downtime.

VLT Terminology

Virtual Link Trunking uses the following set of terms:

  • VLT peer – one of the two switches participating in VLT (you can have a maximum of two switches in a VLT domain)
  • VLT interconnect (VLTi) – interconnect link between the two switches to synchronize the MAC address tables and other VLT-related data
  • VLT backup link – heartbeat link to send keep alive messages between the two switches, it’s also used to identify switch state if VLTi link fails
  • VLT – this is the name of the feature – Virtual Link Trunking, as well as a VLT link aggregation group – Virtual Link Trunk. We will call aggregated link a VLT LAG to avoid ambiguity.
  • VLT domain – grouping of all of the above

VLT Topology

This’s what a sample VLT domain looks like. S4048-ON switches have six 40Gb QSFP+ ports, two of which we use for a VLT interconnect. It’s recommended to use a static LAG for VLTi.


Two 1Gb links are used for VLT backup. You can use switch out-of-band management ports for this. Four 10Gb links form a VLT LAG to the upstream core switch.

Use Cases

So where is this actually helpful? Vast majority of today’s environments are virtualized and do not require LAGs. vSphere already uses teaming on vSwitch uplinks for traffic distribution across all network ports by default. There are some use cases in VMware environments, where you can create a LAG to a vSphere Distributed Switch for faster link failure convergence or improved packet switching. Unless you have a really large vSphere environment this is generally not required, but you may use this option later on if required. Read Chris Wahl’s blog post here for more info.

Where VLT is really helpful is in building a loop-free network topology in your datacenter. See, all your vSphere hosts are connected to both Force10 switches for redundancy. Since traffic comes to either of the switches depending on which uplink is being picked on a ESXi host, you have to make sure that VMs on switch 1 are able to communicate to VMs on switch 2. If all you had in your environment were two Force10 switches, you would establish a LAG between the two and be done with it. But if your network topology is a bit larger than this and you have at least a single additional core switch/router in your environment you’d be faced with the following dilemma. How can you ensure efficient traffic switching in your network without creating loops?


You can no longer create a LAG between the two Force10 switches, as it will create a loop. Your only option is to keep switches connected only to the core and not to each other. And by doing that you will cause all traffic from VMs on switch 1 destined to VMs on switch 2 and vise versa to traverse the core.


And that’s where VLT comes into play. All east-west traffic between servers is contained within the VLT domain and doesn’t need to traverse the core. As shown above, if we didn’t use VLT, traffic from one switch to another would have to go from switch 1 to core and then back from core to switch 2. In a VLT domain traffic between the switches goes directly form switch 1 to switch 2 using VLTi.


That’s a brief introduction to VLT theory. In the next few posts we will look at how exactly VLT is configured and map theory to practice.

Overview of NetApp Replication and HA features

August 9, 2013

NetApp has quite a bit of features related to replication and clustering:

  • HA pairs (including mirrored HA pairs)
  • Aggregate mirroring with SyncMirror
  • MetroCluster (Fabric and Stretched)
  • SnapMirror (Sync, Semi-Sync, Async)

It’s easy to get lost here. So lets try to understand what goes where.



SnapMirror is a volume level replication, which normally works over IP network (SnapMirror can work over FC but only with FC-VI cards and it is not widely used).

Asynchronous version of SnapMirror replicates data according to schedule. SnapMiror Sync uses NVLOGM shipping (described briefly in my previous post) to synchronously replicate data between two storage systems. SnapMirror Semi-Sync is in between and synchronizes writes on Consistency Point (CP) level.

SnapMirror provides protection from data corruption inside a volume. But with SnapMirror you don’t have automatic failover of any sort. You need to break SnapMirror relationship and present data to clients manually. Then resynchronize volumes when problem is fixed.


SyncMirror mirror aggregates and work on a RAID level. You can configure mirroring between two shelves of the same system and prevent an outage in case of a shelf failure.

SyncMirror uses a concept of plexes to describe mirrored copies of data. You have two plexes: plex0 and plex1. Each plex consists of disks from a separate pool: pool0 or pool1. Disks are assigned to pools depending on cabling. Disks in each of the pools must be in separate shelves to ensure high availability. Once shelves are cabled, you enable SyncMiror and create a mirrored aggregate using the following syntax:

> aggr create aggr_name -m -d disk-list -d disk-list

HA Pair

HA Pair is basically two controllers which both have connection to their own and partner shelves. When one of the controllers fails, the other one takes over. It’s called Cluster Failover (CFO). Controller NVRAMs are mirrored over NVRAM interconnect link. So even the data which hasn’t been committed to disks isn’t lost.


MetroCluster provides failover on a storage system level. It uses the same SyncMirror feature beneath it to mirror data between two storage systems (instead of two shelves of the same system as in pure SyncMirror implementation). Now even if a storage controller fails together with all of its storage, you are safe. The other system takes over and continues to service requests.

HA Pair can’t failover when disk shelf fails, because partner doesn’t have a copy to service requests from.

Mirrored HA Pair

You can think of a Mirrored HA Pair as HA Pair with SyncMirror between the systems. You can implement almost the same configuration on HA pair with SyncMirror inside (not between) the system. Because the odds of the whole storage system (controller + shelves) going down is highly unlike. But it can give you more peace of mind if it’s mirrored between two system.

It cannot failover like MetroCluster, when one of the storage systems goes down. The whole process is manual. The reasonable question here is why it cannot failover if it has a copy of all the data? Because MetroCluster is a separate functionality, which performs all the checks and carry out a cutover to a mirror. It’s called Cluster Failover on Disaster (CFOD). SyncMirror is only a mirroring facility and doesn’t even know that cluster exists.

Further Reading

HP BladeSystem c3000

October 29, 2011

We have High Performace Computing (HPC) cluster I’d like to show. It has 72 cores and 152GB of RAM in total. We use ROCKS as cluster middleware. Interconnect is DDR InfiniBand.

We have two groups of servers. First group is two BL2x220c  blades. Since they are double-sided it’s actually four servers. Each with two 4-core CPUs and 16GB of RAM. Second group consists of five BL280c. Each of them also has two 4-core CPUs but 24 GB of RAM. Eighth server is BL260c. This blade serves as master server.

Click pictures to enlarge.

BL280c blade server. This dude has 8 Xeon cores and 24GB of RAM.

Every component of HP BladeSystem c3000 is hot-swappable. Here I show how I disconnect Onboard Adminstrator on fully operational system.

Fans, power supplies and all interconnects are on the back.

Here is the 16-port DDR InfiniBand switch. Each port’s throughput is 80GB/s FDX.

Uplink ports for Onboard Administrator.

16 ports of Ethernet pass-through for blade servers.

Six power supplies in N+1 redundant configuration. Each is capable of 1200 Watts. 7200 Watts in total.

Inside blade server.

InfiniBand mezzanine. One such module is capable of 80Gb/s FDX.

If you are interested in benchmarking results find them here for pure IB and here for IBoIP.