Just Another ESXi 6.0 Storage APD Handling Bug

It’s been a while since we experienced partial outage on our storage infrastructure running under IBM SVC storage. Few LUNs (vdisks in IBM SVC world) presented from storage went offline. I’m not going into details why it happened as HW problems just happens sometimes. Important thing is how are you prepared to mitigate impacts of possible failures.

To me, surprising side effect of this storage outage was that all ESXi’s running ESXi 6.0 U2 got disconnected from vCenter while ESXi 5.5 survived. I was curious as I knew this behavior is something which should-be fixed long time ago with APD Handling feature. APD (All-Paths-Down).

As you may remember ESXi’s prior version 5.0 didn’t have APD Handling feature which effectively meant your host went into not responding (disconnected) state in vCenter every time you experienced some storage outage.  APD Handling was introduced in ESXi 5.0 and can be controlled via Misc.APDHandlingEnable setting.

So I started digging a bit

VMkernel.log was spewed with following messages:

As you can see paths were in NOT READY state. This was reported by storage using SCSI sense codes. The reason for this is because storage controllers (in our case IBM SVC) nodes were still online however they lost underlying storage. This is standard response as controllers cannot say if it will be permanent or temporary condition. You can also see ESXi correctly detected APD situation “failed to issue command due to Not found (APD), try again…”

However biggest issue here was that APD Handling feature wasn’t triggered, there is no log event about it anywhere. Therefore the same situation as pre ESXi 5.0 times.

 If you are not sure, you need to look for log events containing:

esx.problem.storage.apd.start
esx.problem.storage.apd.stop
apdcorrelator

All hosts went back online right after we un-mapped those volumes from the ESXi hosts on storage side, means storage sent SCSI sense code announcing PDL (Permanent Device Loss) and it was handled (luckily) correctly on ESXi side.

VMware Component Protection (VMCP)

This issue effectively means that VMware Component Protection for APD will not work in this case either.  As it is directly connected with APD handling.

VMware actions

I contacted VMware support asking about explanation of this. After couple of weeks of the investigation on their site they accepted it as a valid bug and they opened PR to the engineering to fix it. Lets hope fix will be available soon as this is 2nd issue affecting APD scenarios in vSphere 6.0, first one is described here.

Please note this is not something which is limited to the IBM storage, other vendors use NOT READY codes (although I’m not sure if this happens only in NOT READY scenarios) as well for example NetApp, EMC…

To get more information about APD handling and VMware Component Protection, I also recommend you following blog posts:

https://blogs.vmware.com/vsphere/2011/08/all-path-down-apd-handling-in-50.html

https://blogs.vmware.com/vsphere/2015/06/vm-component-protection-vmcp.html

 

 

The following two tabs change content below.
Dusan has over 6 years experience in Virtualization field. Currently working as Senior VMware plarform Architect at one of the biggest retail bank in Slovakia. He has background in closely related technologies including server operating systems, networking and storage. Used to be a member of VMware Center of Excellence at IBM, co-author of several Redpapers. His main scope of work consists from designing and performance optimization of business critical virtualized solutions on vSphere, including, but not limited to Oracle WebLogic, MSSQL and others. He holds several IT industry leading certifications like VCAP-DCD, VCAP-DCA, MCITP and the others. Honored with #vExpert2015 and 2016 awards by VMware for his contribution to the community. Opinions are my own!

About Dusan Tekeljak

Dusan has over 6 years experience in Virtualization field. Currently working as Senior VMware plarform Architect at one of the biggest retail bank in Slovakia. He has background in closely related technologies including server operating systems, networking and storage. Used to be a member of VMware Center of Excellence at IBM, co-author of several Redpapers. His main scope of work consists from designing and performance optimization of business critical virtualized solutions on vSphere, including, but not limited to Oracle WebLogic, MSSQL and others. He holds several IT industry leading certifications like VCAP-DCD, VCAP-DCA, MCITP and the others. Honored with #vExpert2015 and 2016 awards by VMware for his contribution to the community.
Opinions are my own!

Bookmark the permalink.

4 Comments

  1. Have they supplied any SR number, or KB article, to describe this new problem? i had this the other day! I would like to know so I can track the outcome with engineering. Please comment back,

Comments are closed