Malfunction is worse than failure
Redundancy is key in virtual environments. If one component fails, another will jump in and take over. But what happens if a component does not really fail but isn’t working properly any more. In this case it isn’t easy to detect a failure.
I recently got a call by a friend, that he has suddenly lost all file shares on his (virtual) file server. I opened a connection to a service machine and started some troubleshooting. These were the first diagnostic results:
- Fileserver did not respond to ping.
- Ping to gateway was successful.
- Name resolution against virtual DC was successful.
- A browser session to vCenter failed and vCenter did not respond to ping.
It is a little two-node cluster running on vSphere 6.5 U2. Maybe one ESX has failed? But then HA should have restarted all affected VMs. That was not the case. So I’ve pinged both hosts and got instant reply. No, it did not look like a host crash.
Next I’ve opened the host client to have a look on VMs. All VMs were running.
I’ve opened a console session to the file server and could not login with domain credentials, but with a local account. The file server looked healthy from inside.
Now it became obvious that there was a problem with networking. But all vmnics were active and link status was “up”. The virtual standard switch on which the VM-Network portgroup resided had 3 redundant uplinks with status “up”. So where’s the problem?
I’ve found another VM that responded to ping and had internet connectivity on the same host as vCenter and the fileserver.
I opened a RDP session and from there I was able to ping every VM on the same host. Even vCenter could be connected by browser. Now the picture became clearer. One of the uplinks must have a problem, although it didn’t fail. But which one?
esxtop is your friend
Uplink vmnic0 was used for Management-Network (PG0), vmnic1 and vmnic2 as standby. VM-Network (PG1) used vmnic1 and vmnic2 for traffic and vmnic0 as standby.
To find out which VM is using which vmnic, you can use esxtop. Open a SSH shell on your ESX host and type esxtop. Once it started type key “n” for networking.
About half of the VMs used vmnic1 and the other half vmnic2. I found that all VMs connected to vmnic2 had no problems with connectivity and all VMs conntected to vmnic1 were cut off from the world outside the host.
As long as vmnic1 reports link state “up”, VMs will not fail over to a different vmnic. So I shut down the Switchport to which vmnic1 was connected. VMs instantly failed over to vmnic2 and everything worked normal again.
Countermeasures
Failover detection method on the vSwitch was set to “link state only”. That’s an easy and uncomplicated way to monitor physical connection of an uplink. But it does not say anything about logical connection, or about connection breakdowns somwhere downstream the LAN. Maybe a broken link between two switches or a failed switch unit.
Monitoring link state only sometimes is not sufficient. Recently I had more than once trouble with malfunctions on NIC ports while link state was “up”.
In my recent blog articles “ESX physical uplink resiliency” and “ESX physical uplink resiliency (part 2)” I’ve discussed countermeasures to harden vSphere traffic against downstream physical failures.