Recently I had an opportunity to work with Dell FX2 platform from the design and delivery point of view. I was deploying a FX2s chassis with FC630 blades and FN410S 10Gb I/O aggregators.
I ran into an interesting interoperability glitch between Force10 and vSphere distributed switch when using LLDP. LLDP is an equivalent of Cisco CDP, but is an open standard. And it allows vSphere administrators to determine which physical switch port a given vSphere distributed switch uplink is connected to. If you enable both Listen and Advertise modes, network administrators can get similar visibility, but from the physical switch side.
In my scenario, when LLDP was enabled on a vSphere distributed switch, uplinks on all ESXi hosts started disconnecting and connecting back intermittently, with log errors similar to this:
Lost uplink redundancy on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is down.
Network connectivity restored on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is up
Uplink redundancy restored on DVPorts: “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”, “1549/03 4b 0b 50 22 3f d7 8f-28 3c ff dd a4 76 26 15”. Physical NIC vmnic1 is up
Issue Troubleshooting
FX2 I/O aggregator logs were reviewed for potential errors and the following log entries were found:
%STKUNIT0-M:CP %DIFFSERV-5-DSM_DCBX_PFC_PARAMETERS_MISMATCH: PFC Parameters MISMATCH on interface: Te 0/2
%STKUNIT0-M:CP %IFMGR-5-OSTATE_DN: Changed interface state to down: Te 0/2
%STKUNIT0-M:CP %IFMGR-5-OSTATE_UP: Changed interface state to up: Te 0/2
This clearly looks like some DCB negotiation issue between Force10 and the vSphere distributed switch.
Root Cause
Priority Flow Control (PFC) is one of the protocols from the Data Center Bridging (DCB) family. DCB was purposely built for converged network environments where you use 10Gb links for both Ethernet and FC traffic in the form of FCoE. In such scenario, PFC can pause Ethernet frames when FC is not having enough bandwidth and that way prioritise the latency sensitive storage traffic.
In my case NIC ports on Qlogic 57840 adaptors were used for 10Gb Ethernet and iSCSI and not FCoE (which is very uncommon unless you’re using Cisco UCS blade chassis). So the question is, why Force10 switches were trying to negotiate FCoE? And what did it have to do with enabling LLDP on the vDS?
The answer is simple. LLDP not only advertises the port numbers, but also the port capabilities. Data Center Bridging Exchange Protocol (DCBX) uses LLDP when conveying capabilities and configuration of FCoE features between neighbours. This is why enabling LLDP on the vDS triggered this. When Force10 switches determined that vDS uplinks were CNA adaptors (which was in fact true, I was just not using FCoE) it started to negotiate FCoE using DCBX. Which didn’t really go well.
Solution
The easiest solution to this problem is to disable DCB on the Force10 switches using the following command:
# conf t
# no dcb enable
Alternatively you can try and disable FCoE from the ESXi end by using the following commands from the host CLI:
# esxcli fcoe nic list
# esxcli fcoe nic disable -n vmnic0
Once FCoE has been disabled on all NICs, run the following command and you should get an empty list:
# esxcli fcoe adapter list
Conclusion
It is still not clear why PFC mismatch would cause vDS uplinks to start flapping. If switch cannot establish a FCoE connection it should just ignore it. Doesn’t seem to be the case on Force10. So if you run into a similar issue, simply disable DCB on the switches and it should fix it.
Tags: aggregator, alarm, CDP, Cisco Discovery Protocol, CNA, compatibility, Converged Network Adaptor, Data Center Bridging, Data Center Bridging Exchange Protocol, DCB, DCBX, dell, error, ESXi, FC630, FCoE, FN410S, Force10, FX2, FX2s, glitch, gotcha, interoperability, Link Layer Discovery Protocol, LLDP, mismatch, PFC, Priority Flow Control, QLogic, vDS, virtual distributed switch, vmware, vSphere
July 11, 2018 at 10:46 pm |
Hey thanks man – just ran into this problem and your solution (and accompanying explanation) worked a treat. I appreciate you taking the time to share your findings.
November 17, 2018 at 5:03 am |
Oh, wow! I’m surprised someone else ran into a similar issue. I thought this was very specific to my use case. I hope you got it sorted.
December 16, 2018 at 10:21 am |
Hey Thanks, ran into the same issue, this worked a treat.
December 20, 2018 at 10:34 am |
Hi, Ashton. Great to hear it helped!
July 22, 2020 at 10:08 am |
Thanks man. i have just noticed that i had the same issue. very good and comprehensive explanation of the issue and easy to do solution.
thanks for taking the time to share this. keep it up.
July 24, 2020 at 9:40 pm |
Hi, Osama. Thanks for your feedback. This proves I was not the only one having this problem.