NSX Troubleshooting – VMs out of Network on VNI 5XXX

Currently i am working for customer running Network Virtualization (NSX) in their SDDC environment. Few weeks ago faced issues that multiple VMs out of Network in one of the compute cluster. So wanted to share and hope this will be useful for so many folks working on NSX. Customer is running NSX 6.1.1 with multiple VNIs managing networks for multiple environments. (e.g. Prod, DR, DEV,QA, Test etc.)

Here are the steps:-

  1. After receiving the issue we tried to ping random VMs from the list and VMs were not reachable.
  2. Next step was to find out the VNI number for those VMs and see if all are part of same VNI. And yes those VMs were part of same VNI (e.g. 5XXX)
  3. Once we knew the VNI number next step was to find out if all VMs connected to the VNI 5XXX are impacted or few.
  4. From the step 3 we came to know that only few VMs were impacted not all. After drilling down we found that VMs impacted are running on one of the ESXi hosts in the cluster and VNI working fine with other hosts in the cluster.
  5. To bring the VMs online we moved VMs to another host  and after migrating VMs were reachable and User were able to connect to the applications.
  6. Next was to find out the  Root Cause Analysis (RCA) why VMs connected to VNI 5XXX on ESXi host XXXXXXXXXX  lost network.
  7. Putty to ESXi Host and run the following command to check the VNI status on the host :- net-vdl2 -l. You can see below output screen that VXLAN Network 5002 is DOWN and all impacted VMs were part of this.

VNI19. To fix the issue we need to re-start the NETCPA daemon on the host. Here are list of commands to STOP / START  and CHECK STATUS of NETCPA daemon.

1)  Stopped the netcpa daemon by running –>  /etc/init.d/netcpad stop.

2)  Started the netcpa daemon by running –> /etc/init.d/netcpad start.

3) checked the status of service by running –> /etc/init.d/netcpad status.

10. After starting the NETCPA daemon check the VNI status by running command :- net-vdl2 -l. And now you can see that VXLAN 5002 is UP

VNI211. Next step was to move few VMs on this host from VNI 5002 and check the connectivity status of VMs and Application. All were perfectly fine after moving now on this host.

Note:- This issue has been addressed in NSX version 6.1.4e. If you are running NSX 6.1.4e then may be you will not get this issue. As Controller will be monitoring netcpad daemon and start if it failed on any of the hosts.


Leave a Reply

Your email address will not be published. Required fields are marked *