IBM has released a flash alert regarding a new behavior introduced in vSphere 5.5 Update2 and vSphere 6.0 where VAAI ATS (Atomic Test and Set or in the other words Hardware accelerated locking) is used for heartbeat I/O.
According to IBM:
Due to the low timeout value for heartbeat I/O using ATS, this can lead to host disconnects and application outages if delays of 8 seconds or longer are experienced in completing individual heartbeat I/Os on backend storage systems or the SAN infrastructure.
All version of the Storwize family (and SVC) since 6.4 are affected.
In vCenter’s events you may observe:
Lost access to volume <uuid><volume name> due to connectivity issues. Recovery attempt is in progress and the outcome will be reported shortly
and in ESXi’s vmkernel.log:
ATS Miscompare detected beween test and set HB images at offset XXX on vol YYY
Workaround for now should be disabling of the ATS for Heartbeats only, note you don’t need to disable ATS globally as you would lose the benefit of the ATS, the operation is non-disruptive so it can be performed online – without a host reboot (although IBM is stating otherwise in their KB – we have successfully tested it without outage).
VMware also released KB2113956 already where you can find more details.
Disable ATS for heartbeats:
esxcli system settings advanced set -i 0 -o /VMFS3/useATSForHBOnVMFS5
and to verify with:
esxcli system settings advanced list -o /VMFS3/useATSForHBOnVMFS5
You can read more about disabling VAAI and ATS in VMware kb1033665
There’s no KB from VMware out yet regarding this issue and I have no information if any other storage vendors are affected by this issue too. If you are not sure I suggest you to verify with your storage vendor or VMware.
I will update this article once I’ll have more information and please leave a comment if your storage type is affected as well.
Update 16th April 2015: Added symptoms, VMware KB, changed headline
Update 18th May 2015: looks like EMC VMAX is affected by this as well. You can read more about at http://timsvirtualworld.com/2015/04/ats-miscompare-issue-with-emc-vmax/
Latest posts by Dusan Tekeljak (see all)
- Enabling agentless Guest (VM) RAM monitoring with vRealize Operations 6.3+ - February 14, 2017
- Just Another ESXi 6.0 Storage APD Handling Bug - November 15, 2016
- Broadwell ESXi 6.0 Exception 14 PSOD and Lenovo support fail - August 30, 2016