Unable to re-join EVC cluster after restore of ESXi system
Changing boot media of ESXi hosts (unfortunately) has become a routine job. It is based on the fact, that many flash media have a limited lifespan. To be fair, I need to point out that many customers use (cheap and dirty) USB flash sticks as boot media. But what is good in a homelab, turns out to be a bad idea in enterprise environments.
The usual procedure for media replacement is fairly simple:
- export host configuration
- evacuate and shut down host
- prepare fresh boot medium with installation ISO that has the same or lower patchlevel as the old installation
- boot freshly installed host
- apply (intermediate) IP address if no DHCP available
- restore host configuration
- re-connect to cluster
- apply patches if neccessary
So far so good. But last week I had a nasty experience with a recovered ESXi host.
EVC mismatch
Customer was running ESXi 6.0 U3 with patchlevel 9313334 which translates to 6.0 EP15 (August 2018). All hosts used customized images by Fujitsu. So to recover a host, I had to use a Fujitsu custom image ISO with patchlevel equal or lower than EP15. The only one avilable was ESXi 6.0 U3 build 5050593 (February 2017). Usually that is not a serious problem. Just re-join vCenter and patch host with update manager to the common level. But this time VC gave me the blues. Once I tried to reconnect the host, I had to face this error message:
Reconnect host:The host’s CPU hardware should support the cluster’s current Enhanced vMotion Compatibility mode, but some of the necessary CPU features are missing from the host. Check the host’s BIOS configuration to ensure that no necessary features are disabled (such as XD, VT, AES, or PCLMULQDQ for Intel, or NX for AMD). For more information, see KB articles 1003212 and 1034926.
What?!
This host used to be a member of the cluster before and in the meantime there have been neither changes to EVC settings nor changes in the host’s BIOS.
After doing some research and with help of VMware support we came to the conclusion that it might be correlated with VMware Spectre / Meltdown migitations, that were introduced in 2018. That includes some microcode updates for intel CPU (ESXi600-201803402-BG, ESXi600-201806402-BG, ESXi600-201808402-BG).
Our customized image ESXi 6.0 Update3 did not have these microcode updates included, but for enhanced vMotion compatibility (EVC) these were required. We’ve patched the host on the CLI to the cluster common patch level (9313334). The easiest way to do so, is to copy the patch ZIP files to a shared datastore. In my example it is Datastore1 and the patches remain in a folder named “patch”. Open a SSH connection and update with the command below.
esxcli software vib update -d <path_to_patch.zip>
Example
esxcli software vib update -d /vmfs/volumes/Datastore1/patch/ESXi600-201703001.zip
Use the “update” command instead of “install” if you’re using a customized ESXi image. Install might turn your host unbootable by overwriting or deleting 3rd party drivers. That happened to me once and I’ve lost my FC HBA. So I had to start over from scratch. 🙁
The shell will report what packages have been installed, deleted or skipped. Watch out for the message:
"The update completed successfully, but the system needs to be rebooted for changes to be effective."
Install all packages in chronological order of their release date. There’s no need to do a reboot after each package (unless you’re bored and happy to waste some time). 😉
After successful installation of the latest package type:
reboot
Welcome home
After all microcode updates have been applied to the host, we could re-join the cluster without any problems.