I've been hunting down this issue for one of my clients. They have brand new Dell T440 servers (I've also seen this issue on brand new DELL T640 servers). Great gear - however, after running for a couple of weeks we got this PSOD on one of the servers.
- Reset the system
Okay, let's search for known issues: found this on the VMware forum: https://communities.vmware.com/message/2819124
No solution, let's do basic troubleshooting:
- Intensive memory tests: successful (again, since I did this when I just got the servers)
- New firmware was available: so I installed that
- New ESXi build was available: so I installed that
Everything looked fine. Until a couple of days later: BOOM. Same PSOD, slightly different, this time another PCPU. I was using a SanDisk Ultrafit USB3 memory key inside the server (the mainboard has a USB3 port, and is excellent for running ESXi from there. So I decided on getting another USB key:
- Inserted a new USB3 key, different brand, reinstalled ESXi, and hey, there is another ESXi build and firmware... Let's patch again.
- Inserted some HDDs, installed ESXi on those (RAID1)
I was planning to sort this out by the end of the week, so I reset the server. Now the PSOD was after a few days, reset, now after a day, reset, now twice a day, reset, now one hour... Not good! Also, the other server was displaying the same behavior.
Cause of this issue
I check the logging. One thing that was strange is the
My best guess is that the driver / AHCI controller / DVD-ROM player has an issue, causing massive amounts of errors to be written. The PCPU isn't capable of writing the errors to disk, panics, get's caught by another PCPU and triggers a PSOD.
Resolving this issue
Disabling the driver can be done on the command line. You can find the instructions here in this knowledge base article: https://kb.vmware.com/s/article/2147565
First check which driver is loaded:
esxcli system module list | grep ahci
I have 'vmw_ahci'. But from my experience, if vmw_ahci isn't loaded, it still detects an ahci controller and loads 'ahci'. I havn't tested this driver, but I didn't want to take any chance the server would crash again. So I decided to disable both:
esxcli system module set --enabled=false --module=vmw_ahci esxcli system module set --enabled=false --module=ahci
When running 'esxcli system module list | grep ahci' again I see the driver 'Is Enabled' set to false, but 'Is Loaded' is true. So rebooting the server helps.