After a failed firmware update on my Intel x722 NICs one host came up without its 10 Gbit kernelports (vSAN Network). Every effort of recovery failed and I had to send in my “bricked” host to Supermicro. Normally this shouldn’t be a big issue in a 4-node cluster. But the fact that management interfaces were up and vSAN interfaces were not must have caused some “disturbance” on the cluster and all my VM objects were marked as “invalid” on the 3 remaining hosts.
I was busy on projects so I didn’t have much lab-time anyway, so I waited for the repair of the 4th host. Last week it finally arrived and I instantly assembled boot media, cache and capacity disks. I checked MAC addresses and settings on the repaired host and everything looked good. But after booting the reunited cluster still all objects were marked invalid.
Time for troubleshooting
First I opened SSH shells to each host. There’s a quick powerCLI one-liner to enable SSH throughout the cluster. Too bad I didn’t have a functional vCenter at that time, so I had to activate SSH on each host with the host client.
From the shell of the repaired host I’ve checked the vSAN-Network connection to all other vSAN kernel ports . The command below pings from interface vmk1 (vSAN) to IP 10.0.100.11 (vSAN kernel port of esx01 for example)
vmkping -I vmk1 10.0.100.11
I received ping responses from all hosts on all vSAN kernel ports. So I could conclude there’s no connection issue in the vSAN-network.
Next we need to find out if the vSAN-cluster is intact and all members know each other. Maybe there’s a network partition and the unicast agents on each host cannot communicate with all hosts in the cluster. To find out we need two pieces of information from each host:
- IP address and name of the vSAN kernel port
- Host UUID
Kernel port details
I know my vSAN kernel port is vmk1. But in case you do not know, here’s a command to find out.
# esxcli vsan network list
Look for the result in line VmkNic Name: That’s your vmkernel adapter (in my case vmk1). With that information we can get the IP-address of the kernel adapter. Adjust the grep command according to your kernel port.
root@esx01:~] esxcli network ip interface ipv4 get | grep vmk1
vmk1 10.0.100.11 255.255.255.0 10.0.100.255 STATIC 0.0.0.0 false
Host UUID
We’ve got the kernel port name and its IP address. Now we need to know the host UUID.
[root@esx01:~] cmmds-tool whoami
5eaf1b92-7d12-9f3e-556f-002590bc4d64
Copy-paste these informations into a notepad or a spreadsheet and repeat for all hosts. Now check settings of the unicast agent on every host.
[root@esx01:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- ----------- ----- ---------- ----------------------------------------------------------- --------------
5eaf1c30-73d8-2368-ad23-002590bb2ed0 0 true 10.0.100.13 12321 0F:0D:99:30:76:C7:02:87:09:18:EB:C3:12:F7:75:23:99:88:8E:96
5eaf1c66-12d7-3de4-1310-002590bb3008 0 true 10.0.100.14 12321 01:89:11:C0:5A:FF:C4:41:E8:E9:E1:25:2C:E2:93:46:FC:32:3F:11
5eaf1c43-e997-774a-9bb4-002590bc4cdc 0 true 10.0.100.12 12321 4B:5D:A1:88:C9:56:32:9E:03:F5:DD:5E:4D:82:0D:FC:F6:44:6D:42
The first column will show you the UUIDs of all hosts in the cluster except the one you’ve issued the command from. In the example above I’ve issued the command from esx01 and I’ve received the UUIDs and IP addresses from esx03 (line1), esx04 (line2) and esx02 (line3). That’s good. The unicast agent of esx01 has registered all other hosts.
Repeat this step on all hosts.
If there are missing member hosts in the list you can adjust settings on the CLI. Follow guidelines to manually set unicast agent settings as described in KB2150303.
I was able to confirm that all unicast agent settings throughout the cluster were correct and no host was missing. Yet still VM objects were marked invalid.
Get vSAN cluster settings
So far we can conclude that there’s nothing wrong with unicast agent settings. Let’s get some more cluster information.
[root@esx01:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2020-09-17T10:36:03Z Local Node UUID: 5eaf1b92-7d12-9f3e-556f-002590bc4d64 Local Node Type: NORMAL Local Node State: AGENT Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5eaf1c30-73d8-2368-ad23-002590bb2ed0 Sub-Cluster Backup UUID: 5eaf1c66-12d7-3de4-1310-002590bb3008 Sub-Cluster UUID: 527a6824-b9bf-a7ad-36f4-8a2cd78b9685 Sub-Cluster Membership Entry Revision: 3 Sub-Cluster Member Count: 4 Sub-Cluster Member UUIDs: 5eaf1c30-73d8-2368-ad23-002590bb2ed0, 5eaf1c66-12d7-3de4-1310-002590bb3008, 5eaf1b92-7d12-9f3e-556f-002590bc4d64, 5eaf1c43-e997-774a-9bb4-002590bc4cdc Sub-Cluster Member HostNames: esx03.lab.local, esx04.lab.local, esx01.lab.local, esx02.lab.local Sub-Cluster Membership UUID: 2534635f-f8c8-760f-a3b1-002590bb2ed0 Unicast Mode Enabled: true Maintenance Mode State: ON Config Generation: d89f0896-2f64-4ebb-8232-de45a28b6392 37 2020-08-14T19:22:44.617
This summary shows some more detailed information about the cluster. There are 3 types of nodes in a vSAN cluster: Master, Backup and Agents. There’s always one Master host in the cluster. It will receive clustering service (CMMDS) updates from all other hosts. Then there’s one Backup node. It will take over master role if the master is not present. All other hosts are agents. They can become backup-node if the current backup node takes over the master role.
As you can see in the result above, esx01 is an agent node, there are 4 members in the cluster and the local node is healthy. All 4 member UUIDs and hostnames are correct. Similar results on esx02, esx03 and esx04.
[root@esx02:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2020-09-17T10:35:46Z Local Node UUID: 5eaf1c43-e997-774a-9bb4-002590bc4cdc Local Node Type: NORMAL Local Node State: AGENT Local Node Health State: HEALTHY Sub-Cluster Master UUID: 5eaf1c30-73d8-2368-ad23-002590bb2ed0 Sub-Cluster Backup UUID: 5eaf1c66-12d7-3de4-1310-002590bb3008 Sub-Cluster UUID: 527a6824-b9bf-a7ad-36f4-8a2cd78b9685 Sub-Cluster Membership Entry Revision: 3 Sub-Cluster Member Count: 4 Sub-Cluster Member UUIDs: 5eaf1c30-73d8-2368-ad23-002590bb2ed0, 5eaf1c66-12d7-3de4-1310-002590bb3008, 5eaf1b92-7d12-9f3e-556f-002590bc4d64, 5eaf1c43-e997-774a-9bb4-002590bc4cdc Sub-Cluster Member HostNames: esx03.lab.local, esx04.lab.local, esx01.lab.local, esx02.lab.local Sub-Cluster Membership UUID: 2534635f-f8c8-760f-a3b1-002590bb2ed0 Unicast Mode Enabled: true Maintenance Mode State: ON Config Generation: d89f0896-2f64-4ebb-8232-de45a28b6392 37 2020-08-14T19:22:44.634
Esx02 is an agent node too. All other hosts are registered.
[root@esx03:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2020-09-17T10:35:16Z
Local Node UUID: 5eaf1c30-73d8-2368-ad23-002590bb2ed0
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 5eaf1c30-73d8-2368-ad23-002590bb2ed0
Sub-Cluster Backup UUID: 5eaf1c66-12d7-3de4-1310-002590bb3008
Sub-Cluster UUID: 527a6824-b9bf-a7ad-36f4-8a2cd78b9685
Sub-Cluster Membership Entry Revision: 3
Sub-Cluster Member Count: 4
Sub-Cluster Member UUIDs: 5eaf1c30-73d8-2368-ad23-002590bb2ed0, 5eaf1c66-12d7-3de4-1310-002590bb3008, 5eaf1b92-7d12-9f3e-556f-002590bc4d64, 5eaf1c43-e997-774a-9bb4-002590bc4cdc
Sub-Cluster Member HostNames: esx03.lab.local, esx04.lab.local, esx01.lab.local, esx02.lab.local
Sub-Cluster Membership UUID: 2534635f-f8c8-760f-a3b1-002590bb2ed0
Unicast Mode Enabled: true
Maintenance Mode State: ON
Config Generation: d89f0896-2f64-4ebb-8232-de45a28b6392 37 2020-08-14T19:22:44.597
Esx03 has the master node role. All other hosts are registered.
[root@esx04:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2020-09-17T10:34:36Z
Local Node UUID: 5eaf1c66-12d7-3de4-1310-002590bb3008
Local Node Type: NORMAL
Local Node State: BACKUP
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 5eaf1c30-73d8-2368-ad23-002590bb2ed0
Sub-Cluster Backup UUID: 5eaf1c66-12d7-3de4-1310-002590bb3008
Sub-Cluster UUID: 527a6824-b9bf-a7ad-36f4-8a2cd78b9685
Sub-Cluster Membership Entry Revision: 3
Sub-Cluster Member Count: 4
Sub-Cluster Member UUIDs: 5eaf1c30-73d8-2368-ad23-002590bb2ed0, 5eaf1c66-12d7-3de4-1310-002590bb3008, 5eaf1b92-7d12-9f3e-556f-002590bc4d64, 5eaf1c43-e997-774a-9bb4-002590bc4cdc
Sub-Cluster Member HostNames: esx03.lab.local, esx04.lab.local, esx01.lab.local, esx02.lab.local
Sub-Cluster Membership UUID: 2534635f-f8c8-760f-a3b1-002590bb2ed0
Unicast Mode Enabled: true
Maintenance Mode State: ON
Config Generation: d89f0896-2f64-4ebb-8232-de45a28b6392 37 2020-08-14T19:22:44.677
Esx04 is the backup node. All other hosts are registered.
Finally a simple solution
My cluster looks pretty healthy, yet still all VM objects are invalid. But there’s a little fact, I’ve missed all the time. Right before the failed firmware update, I’ve put the whole cluster into maintenance mode and even after re-uniting esx04 with the cluster I didn’t change that. You can read in the second last line of all results above Maintenance Mode State: ON.
You can exit maintenance mode either on the host client or on the SSH shell.
esxcli system maintenanceMode set --enable false
I repeated that on each host and within seconds all VMs appeared to be healthy. Could have been easier but it has been a good exercise and that is what the lab exercises are all about.
Not because they are easy, but because they are hard.
John F. Kennedy, 1962