Updated: Storage disconnects when using VAAI ATS on vSphere 5.5 Update2 and vSphere 6.0

IBM has released a flash alert regarding a new behavior introduced in vSphere 5.5 Update2 and vSphere 6.0 where VAAI ATS (Atomic Test and Set or in the other words Hardware accelerated locking) is used for heartbeat I/O.

http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005201

According to IBM:

Due to the low timeout value for heartbeat I/O using ATS, this can lead to host disconnects and application outages if delays of 8 seconds or longer are experienced in completing individual heartbeat I/Os on backend storage systems or the SAN infrastructure.

All version of the Storwize family (and SVC) since 6.4 are affected.

SYMPTOMS:

In vCenter’s events you may observe:

Lost access to volume <uuid><volume name> due to connectivity issues. Recovery attempt is in progress and the outcome will be reported shortly

and in ESXi’s vmkernel.log:

ATS Miscompare detected beween test and set HB images at offset XXX on vol YYY

FIX:

Workaround for now should be disabling of the ATS for Heartbeats only, note you don’t need to disable ATS globally as you would lose the benefit of the ATS, the operation is non-disruptive so it can be performed online – without a host reboot (although IBM is stating otherwise in their KB – we have successfully tested it without outage).

VMware also released KB2113956 already where you can find more details.

Disable ATS for heartbeats:

esxcli system settings advanced set -i 0 -o /VMFS3/useATSForHBOnVMFS5

and to verify with:

esxcli system settings advanced list -o /VMFS3/useATSForHBOnVMFS5

You can read more about disabling VAAI and ATS in VMware kb1033665

There’s no KB from VMware out yet regarding this issue and I have no information if any other storage vendors are affected by this issue too. If you are not sure I suggest you to verify with your storage vendor or VMware.

I will update this article once I’ll have more information and please leave a comment if your storage type is affected as well.

 

Update 16th April 2015:  Added symptoms, VMware KB, changed headline

Update 18th May 2015: looks like EMC VMAX is affected by this as well. You can read more about at http://timsvirtualworld.com/2015/04/ats-miscompare-issue-with-emc-vmax/

The following two tabs change content below.
Dusan has over 6 years experience in Virtualization field. Currently working as Senior VMware plarform Architect at one of the biggest retail bank in Slovakia. He has background in closely related technologies including server operating systems, networking and storage. Used to be a member of VMware Center of Excellence at IBM, co-author of several Redpapers. His main scope of work consists from designing and performance optimization of business critical virtualized solutions on vSphere, including, but not limited to Oracle WebLogic, MSSQL and others. He holds several IT industry leading certifications like VCAP-DCD, VCAP-DCA, MCITP and the others. Honored with #vExpert2015 and 2016 awards by VMware for his contribution to the community. Opinions are my own!

About Dusan Tekeljak

Dusan has over 6 years experience in Virtualization field. Currently working as Senior VMware plarform Architect at one of the biggest retail bank in Slovakia. He has background in closely related technologies including server operating systems, networking and storage. Used to be a member of VMware Center of Excellence at IBM, co-author of several Redpapers. His main scope of work consists from designing and performance optimization of business critical virtualized solutions on vSphere, including, but not limited to Oracle WebLogic, MSSQL and others. He holds several IT industry leading certifications like VCAP-DCD, VCAP-DCA, MCITP and the others. Honored with #vExpert2015 and 2016 awards by VMware for his contribution to the community. Opinions are my own!
Bookmark the permalink.

14 Comments

  1. This looks like a issue which I have in one of my customers…. Thanks a lot for sharing! If this resolves my problem, I own you a beer!

  2. let us know!

  3. Had a major downtime in a fresh 6.0 cluster with IBM SVC and V7000, >350 VMs were down because of this. Thanks VMware.

    • Shit happens, but believe me Ralf you are not the only one in this :/
      Happened to me – luckily in a very small greenfield preprod environment as PoC in beginning of the migration. But I also heard about the big ones. And I suppose there will be still some in upcoming months as companies will continue with vSphere 6 adaptation.

  4. Pingback: IBM FlashSystem V9000 and VMware vSphere ESXi Guidelines - The Virtualist

  5. Hi, This issue also relevant to IBM XIV storage system ?
    We have a big issue with the same error message, but without the messages on the vmkernel.log.

    • I was having chat with multiple storage experts and they were saying that 8s timeout in the SAN world is too low and it can impact the other storage vendors as well, but most of the time virtualized types like SVC,VMAX…. I cannot say for XIV, but if you don’t have it in your vmkernel.log, it is most probably something else.

  6. EMC has now also published the an article for the VPLEX https://support.emc.com/kb/207382

    We ran into this issue on the vplex, but it was made infinitely worse by the ESX qLogic HBA driver not handling the ATS Miscompare SCSI Sense code. The end result is that the driver would occasionally flip out, and an entire HBA would loose -all- its paths. When this happened to 2 HBAs simultaneously, we lost all storage on that host, and thus all the VMs.

    Thankfully, this issue is fixed in driver version 1.1.58.0:
    Here is the details of what was wrong and what was fixed: (from the release notes)
    ——————————————————
    Between versions 1.1.54.0 and 1.1.55.0:

    * Problem Description: ATS miscompare check conditions were not being reported to upper layer. <- Here is where we see messages not being reported to NMP. This is the first attempt to fix the issue.

    * Solution: Driver was interpreting the error condition as a "dropped frame" scenario and the issue was never rectified. Fix was to check for this miscompare condition before determining if it was indeed a "dropped frame".

    Between versions 1.1.56.0 and 1.1.57.0:

    * Problem Description: Omit SCSI opcode check in ATS miscompare check condition. <- Here we see the incorrect response to the ATS miscompare. This is the correct fix for the issue.

    * Solution: Omit SCSI opcode check and rely on sense key and ASC to determine if this scenario is encountered.
    ——————————————————

    We have now fixed this issue by both updating the HBA Driver in ESX, as well as turning off ATS Heartbeating as to the described workaround.

  7. Savvy discussion . I was fascinated by the points , Does anyone know if my assistant might be able to acquire a blank a form document to type on ?

  8. This post is genius. Have been losing and gaining connectivity on my whitebox ESXi 6.5 with just local SSD storage every 15 seconds for days. Been causing no end of issues. The workaround here has solved it and my box is now flying like Concorde.
    Thanks OP

Comments are closed