vMotion fails at 21% with error 195887371

How to troubleshoot vMotion issues

Troubleshooting vMotion issues is in most cases a matter of networking issues. I will demonstrate in this case how to trace down the problem and how to find possible culprits.

What’s the problem?

Initiating a host vMotion between esx1 and esx2 passes all pre-checks, but then fails at 21% progress.

Migrate virtual machine:Failed waiting for data. Error 195887371. The ESX hosts failed to connect over the VMotion network.

See the error stack for details on the cause of this problem.
Time: 07.01.2018 19:08:08
Target: WSUS
vCenter Server: vc
Error Stack
Migration [167797862:1515348488969364] failed to connect to remote host <192.168.45.246> from host <10.0.100.102>: Timeout.
vMotion migration [167797862:1515348488969364] vMotion migration [167797862:1515348488969364] stream thread failed to connect to the remote host <192.168.45.246>: The ESX hosts failed to connect over the VMotion network
The vMotion migrations failed because the ESX hosts were not able to connect over the vMotion network. Check the vMotion network settings and physical network configuration. 
Migration [167797862:1515348488969364] failed to connect to remote host <10.0.100.102> from host <192.168.45.246>: Timeout.
vMotion migration [167797862:1515348488969364] failed to create a connection with remote host <10.0.100.102>: The ESX hosts failed to connect over the VMotion network
Failed waiting for data. Error 195887371. The ESX hosts failed to connect over the VMotion network.

Check VLAN

My first step was to check the VLAN settings of my vmkernel adapter.

Both vmkernel ports on each host were set to VLAN 100.

 

All Switchports were tagged members of VLAN 20 (iSCSI) and VLAN 100 (vMotion).

vMotion enabled?

One of the simplest causes is not enabling the kernelport for vMotion. It may sound stupid, but it happens quite often. 🙂

But as you can see in the picture above, it is vMotion enabled (and on the other host too).

Check connectivitiy

Next step is usually to send a vmkping over the adapter in question. We’ll send a PING over vmk3 (vMotion) to its corresponding partner on esx2 with IP 10.0.100.102. Parameters -4 will force the ping over IPv4 and -I makes sure to use interface vmk3 (vMotion).

[root@esx1:~] vmkping -4 -v -I vmk3 10.0.100.102
PING 10.0.100.102 (10.0.100.102): 56 data bytes
64 bytes from 10.0.100.102: icmp_seq=0 ttl=64 time=0.353 ms
64 bytes from 10.0.100.102: icmp_seq=1 ttl=64 time=0.502 ms
64 bytes from 10.0.100.102: icmp_seq=2 ttl=64 time=0.456 ms

--- 10.0.100.102 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.353/0.437/0.502 ms

This is a very important test. It shows us that there is connectivity between both vMotion kerneladapters.

What about MTU?

Jumbo frames have a Maximum Transmission Unit (MTU) set to 9000. A common source of connectivity issues. You need to make sure that all interfaces along the path have MTU 9000 set. We can check it with a customized PING command. Option -s will send a 8972 byte frame. Option -d stands for “do not fragment”. So we will send a 8972 byte frame to the kernelport of esx2. If somewhere on the path a standard MTU of 1500 is set, the ping will fail.

[root@esx1:~] vmkping -4 -d -s 8972 -I vmk3 10.0.100.102
PING 10.0.100.102 (10.0.100.102): 8972 data bytes
8980 bytes from 10.0.100.102: icmp_seq=0 ttl=64 time=0.601 ms
8980 bytes from 10.0.100.102: icmp_seq=1 ttl=64 time=0.449 ms
8980 bytes from 10.0.100.102: icmp_seq=2 ttl=64 time=0.447 ms

--- 10.0.100.102 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.447/0.499/0.601 ms

We can see from the result, that there’s no problem with the MTU. All packets were properly transmitted.

Other vMotion Networks?

At this point. Everything should be fine – but obviously it isn’t.

What if there is another vMotion enabled kernelport in a different network? We might check every kernelport manually, but ist is smarter with a script.

Open PowerCLI and connect to your vCenter

Connect-VIserver vc

Enter Credentials and execute this one-liner:

Get-VMHostNetworkAdapter -VMKernel | select VMhost, Name, IP, SubnetMask, PortGroupName, vMotionEnabled, mtu | where {$_.VMotionEnabled}

It will check all kernelport adapters on all hosts where VMotion is enabled.

VMHost : esx2.xxxx.xxxx.de
Name : vmk3
IP : 10.0.100.102
SubnetMask : 255.255.255.0
PortGroupName : vMotion
VMotionEnabled : True
Mtu : 9000

VMHost : esx1.xxxx.xxxx.de
Name : vmk0
IP : 192.168.45.246
SubnetMask : 255.255.255.0
PortGroupName : Management Network
VMotionEnabled : True
Mtu : 1500

VMHost : esx1.xxxx.xxxx.de
Name : vmk3
IP : 10.0.100.101
SubnetMask : 255.255.255.0
PortGroupName : vMotion
VMotionEnabled : True
Mtu : 9000

Now here’s the problem: There is still a vMotion enabled Management Network Adapter in subnet 192.168.45.0 on esx1.

After migration vMotion to a new network, it’s been forgotten to uncheck vMotion on the old Management Network Portgroup. Packets sent over vmk0 on esx1 could not reach a corresponding kernelport on esx2 and the whole process failed.

 

One Reply to “vMotion fails at 21% with error 195887371”

  1. You rock! this article saved my day. Brought in 2 new nodes and couldn’t vmotion to one of them. I was banging my head, and the last step about vmotion being enabled on multiple kernel adapters was the trick.

Leave a Reply

Your email address will not be published. Required fields are marked *