Posts Tagged ‘debug’

Reminder: Disable Firewall on NSX ECMP Edge

October 15, 2019

ECMP and Stateful Services

It’s not new, this topic has already been discussed many times before, examples are here, here, here and here. When NSX Edges are configured in ECMP mode, none of the stateful services like VPN, NAT or Load Balancing are supported.

From NSX Design Guide:

In ECMP mode, only routing service is available. Stateful services cannot be supported due to asymmetric routing inherent in ECMP-based forwarding.

Even if you didn’t read documentation, but have networking skills, you’d know that protocols like NAT need to track network session state and even if you configure the same NAT rule on all of your ECMP-enabled edges, it won’t work, because due to ECMP, traffic can flow through one ESG on ingress and another ESG on egress. Since NAT tables are not synchronized, ESGs won’t be able to find the corresponding network flow in translation table and will drop the traffic.

ECMP and Firewall

But there’s another issue that doesn’t always come across or simply get forgotten about. You can deploy ESGs in ECMP mode, not configure any of the stateful services like VPN, NAT or LB, but still get network communication issues. Why? Because when you deploy an ESG, you always end up with firewall in enabled state. Firewall is also considered a stateful service.

From VVD 5.1 documentation:

SDDC-VISDN-032: For all ESGs deployed as ECMP North-South routers, disable the firewall. Use of ECMP on the ESGs is a requirement. Leaving the firewall enabled, even in allow all traffic mode, results in sporadic network connectivity. Services such as NAT and load balancing cannot be used when the firewall is disabled.

In fact, firewall is what actually tracks sessions and drops packets that don’t match existing network flows, not NAT itself. That’s also the reason why services like NAT and LB don’t work without firewall being enabled.

It often throws people off, because even having no rules in the firewall and setting default policy to accept will not prevent this issue from happening.

Demo

Here is a quick demonstration. I’m trying to establish an SSH session to a VM connected to a DLR behind two ESGs in ECMP mode.

I’m showing packet debug on both ESGs using the following command:

> debug packet display follow interface vNic_1 port_22

As you can see ingress traffic goes through E1 and egress traffic goes through E2:

E1: Packet Capture

E2: Packet Capture

Since session originated on E1, E2 interprets packets as invalid and immediately drops them:

From NSX Troubleshooting Guide:

Check for an incrementing value of a DROP invalid rule in the POST_ROUTING section of the show firewall command. Typical reasons include:

  • Asymmetric routing issues

Conclusion

It’s easy to end up in this situation, because firewall is enabled by default on a newly deployed ESG. And it’s hard to troubleshoot this issue, since it’s not quite obvious what’s actually going on unless you’ve already worked with ECMP before. So the best advice in this case is just to remember, if you want to use ECMP in NSX, make sure to disable firewall on ECMP-enabled ESGs. Use distributed firewall (DFW) instead.

Troubleshooting vSphere Guest Operations API

October 4, 2019

What is vSphere Guest Operations

Recently I’ve been heavily utilizing vSphere Guest Operations API for automating vCenter patching. vSphere Guest Operations (GuestOps) is an API, which allows you to run commands on a virtual machine without needing to connect to it over the network. All you need is credentials to the vCenter managing the virtual machine and to the virtual machine itself.

GuestOps can be called by using an Invoke-VMScript PowerCLI cmdlet in the following format:

> Invoke-VMScript -ScriptText “uname -a” -vm vc01 -GuestUser root -GuestPassword VMware1!

Cmdlet will talk to the vCenter, vCenter will talk to ESXi host, ESXi host will talk to VMware Tools and, eventually, VMware Tools will run the command on the Guest OS.

It worked well for me when I was running commands on VCSA 6.0 VM (managed by another vCenter), but after patching and upgrading this VM to VCSA 6.7 I encountered the following error:

Error occured while executing script on guest OS in VM ‘vc01’. Could not locate “Powershell” script interpreter in any of the expected locations. Probably you do not have enough permissions to execute command within guest.

It’s obvious from the error message that cmdlet is doing something wrong, since it’s supposed to use bash in Linux, not PowerShell.

Enable Debugging in VMware Tools

To better understand what was going on, I logged in to VCSA via SSH and enabled VMware Tools debugging (see KB1007873 for instructions on how to do that) and restarted Open VM Tools:

# systemctl restart vmtoolsd.service

After running the Invoke-VMScript cmdlet again, this is what I noticed in vmsvc.log debug log:

[vix] VixTools_StartProgram: User: root args: progamPath: ‘cmd.exe’, arguments: ‘/C powershell -NonInteractive -EncodedCommand cABvAHcAZQByAHMAaABl…

So it wasn’t just a misleading PowerCLI error message, Invoke-VMScript was actually trying to call a PowerShell command using Windows command interpreter on a Linux VM.

Solution

My guess is that since VMware has changed underlying operating system on VCSA from SUSE Linux to Photon OS, Invoke-VMScript can no longer properly identify the underlying OS and defaults to Windows.

Simple solution to this problem is to give a helping hand to Invoke-VMScript cmdlet and specify interpreter using -ScriptType Bash parameter. This is what a proper resulting debug log message will look like:

[vix] VixToolsStartProgramImpl: started ‘”/bin/bash” -c “bash > /tmp/vmware-root/powerclivmware159 2>&1 -c \”uname -a\””‘, pid 7456

DFS Replication Troubleshooting

June 25, 2013

conceptual 3d rendered image of arrow isolated on whiteDFS Replication service doesn’t give you much information on how it’s replicating. It’s good to know some general commands to troubleshoot communication and data transfer issues.

Useful Commands

In Windows Server 2008 a new command was introduced to check what DFSR is doing at the moment. You won’t find it in Windows Server 2003:

> dfsrdiag replicationstate

If replication link isn’t feeling well you get lots of files in the backlog. To check if you have a backlog, run:

> dfsrdiag backlog /rgname:rgroup_name /rfname:folder_name /sendingmember:sending_server /receivingmember:receiving_server

If there are heaps of files in the backlog the best way to find the reason for it is to simply check the logs. DFSR logs are located in C:\Windows\debug. To get the most verbose information change the log severity level:

> wmic /namespace:\\root\microsoftdfs path dfsrmachineconfig set debuglogseverity=5

DFSR uses GUIDs to identify the replicated files, which look like: AC759213-00AF-4578-9C6E-EA0764FDC9AC. To get the meaningful data from the GUID use:

> dfsrdiag guid2name /guid:guid_identifier /rgname:group_name

There is one more command which allows you to find the exact path to the file in question. You should feed the uid field from the DFSR debug log to this command, which looks like {9EBE0A27-8AA9-4263-B942-DA9A92F30671}-v240880:

> wmic.exe /namespace:\\root\microsoftdfs path dfsridrecordinfo.Uid=”uid_identifier” call getfullfilepath

Sample Errors

1. When replicating between Windows Server 2008 R2 and Windows Server 2003 R2. On the source: “Ghosting is not enabled”. On the destination: “A failure was reported by the remote partner”.

I solved this error by applying the following patch: KB2462352. The reason for the issue is incompatibilities between protocol implementations.

2. The following error pops up in logs: “The system cannot find the file specified”.

Solution is described in KB951010. In Windows Server 2003 ConflictAndDeleted folder sometimes fills up above the 660MB quota and ConflictAndDeletedManifest.xml file may get corrupted. To solve the problem you need to cleanup the folder and delete the file by issuing:

> wmic /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where “replicatedfolderguid='<GUID>'” call cleanupconflictdirectory

To get the GUIDs of replicated folders run:

> wmic /namespace:\\root\microsoftdfs path dfsrreplicatedfolderconfig get replicatedfolderguid,replicatedfoldername

3. Near 100% CPU usage and the same error is written millions of times in the log files: “Failed to create stage file for GVSN gvsn_identitifer”.

I solved this issue by looking for the file specified by gvsn_identifier, which looks like {2ED37126-12C7-4617-AE6B-34509F467FEB}-v20748 and deleting it. These are files that are located in the staging folder.

Other Hepful Tools

You can create a Health Report from the DFS Management Console to see how many files have been transfered between replication members since the DFS service start. And if there are any DFS errors in the members’ event logs.

You can also use DFSRMon tool. But I personally don’t find it very useful.