Solving PSOD 'Panic Requested by another PCPU'

I've been hunting down this issue for one of my clients. They have brand new Dell T440 servers (I've also seen this issue on brand new DELL T640 servers). Great gear - however, after running for a couple of weeks we got this PSOD on one of the servers.

PSOD: NMI IPI: Panic requested bt another PCPU. RIPOFF(base)
This PSOD is not good (in general: PSOD's never are, haha). As this is a brand new server, I did the following:

  • Reset the system

Yes, I know, a bit lazy, I was thinking: this might be a one-time event, just reset that server... However, after a couple of weeks: BOOM. Same PSOD, this time another PCPU.

Okay, let's search for known issues: found this on the VMware forum: https://communities.vmware.com/message/2819124

No solution, let's do basic troubleshooting:

  • Intensive memory tests: successful (again, since I did this when I just got the servers)
  • New firmware was available: so I installed that
  • New ESXi build was available: so I installed that

Everything looked fine. Until a couple of days later: BOOM. Same PSOD, slightly different, this time another PCPU. I was using a SanDisk Ultrafit USB3 memory key inside the server (the mainboard has a USB3 port, and is excellent for running ESXi from there. So I decided on getting another USB key:

  • Inserted a new USB3 key, different brand, reinstalled ESXi, and hey, there is another ESXi build and firmware... Let's patch again.

Hoping it would finally resolve this issue. However, after a couple of weeks: BOOM. The same PSOD - DAMN YOU STUPID SERVER. Next, I did:

  • Inserted some HDDs, installed ESXi on those (RAID1)

Guess what; it's stable. Whoohoo - uh oh - BOOM. Same PSOD... After a couple of weeks again.

I was planning to sort this out by the end of the week, so I reset the server. Now the PSOD was after a few days, reset, now after a day, reset, now twice a day, reset, now one hour... Not good! Also, the other server was displaying the same behavior.
1
 

Cause of this issue

use 'ls -e' to display the datetime stamp in seconds

I check the logging. One thing that was strange is the vmkernel logs (timestamp is within seconds), this tells me a lot of logging was done in a very short time. Also the logs are small (I know: compressed, but even then: they are small even for compressed files).


lot's of the same message from the ahci driver

Inside the vmkernel logs there were a lot of message like this. Check the driver causing these issues: ahci!!! I'm not using AHCI (which is a controller: the 'new' IDE), not for my SSDs nor the HDDs (they are on the PERC raid controller). There is a DVD-ROM player though. I decided to disable the driver, since I'm not using it anyway.

My best guess is that the driver / AHCI controller / DVD-ROM player has an issue, causing massive amounts of errors to be written. The PCPU isn't capable of writing the errors to disk, panics, get's caught by another PCPU and triggers a PSOD.

Resolving this issue 

Disabling the driver can be done on the command line. You can find the instructions here in this knowledge base article: https://kb.vmware.com/s/article/2147565

First check which driver is loaded:

esxcli system module list | grep ahci 

I have 'vmw_ahci'. But from my experience, if vmw_ahci isn't loaded, it still detects an ahci controller and loads 'ahci'. I havn't tested this driver, but I didn't want to take any chance the server would crash again. So I decided to disable both:

esxcli system module set --enabled=false --module=vmw_ahci
esxcli system module set --enabled=false --module=ahci 
vmw_ahci is still loaded, server requires a reboot

​When running 'esxcli system module list | grep ahci' again I see the driver 'Is Enabled' set to false, but 'Is Loaded' is true. So rebooting the server helps.

vmw_ahci and ahci is not loaded

 Now the server is displaying the correct state:

 
NLVMUG: Here I come!
Troubleshooting disk latency (cont.)
 

Comments 1

Guest - Bryan on Thursday, 20 June 2019 11:24

same problem - thank you!

same problem - thank you!
Guest
Friday, 23 August 2019