Troubleshooting disk latency
I was asked to troubleshoot a VM. SCOM was reporting the following message: "Logical disk transfer (reads and writes) latency is too high - The threshold for the Logical Disk\Avg.". Also, jobs (this is a SQL server) took longer to run (almost twice a long). Now, this seems like an issue with disk I/O: the path to storage endpoint (SAN), incorrect HBA configuration or perhaps congestion on the storage array - none of them are true. Actually, it doesn't have anything to do with disk latency!
So I opened up VROPS. Looked at the CPU | Contention. These are high numbers. The contention would suggest this VM is fighting for resources, cannot get it, and would be able to perform better. So I checked the host and the
Next, I took a look at the CPU | Demand. Demand was high, but while demand was high, the contention was low! So this confirmed the conflict with other VMs issue - there is none. Something else is going on here! I remembered an issue with C-State throttling on Intel CPUs. If a VM is low on CPU Demand, the physical CPU doesn't require to run to full clock speeds. However, once a VM demands more CPU resources, the CPU needs to switch to a higher rate, and the duration to switch to that faster state introduces latency. It's called %C_Lat and can be viewed in
So I opened up OneView, checked the profile and by default it's configured to 'Balanced Power and Performance'. I've switched it to 'Maximum Performance'. Have a look at the screenshot below, and see what changes:
Awesome... I've moved the VM back to it's original server. And we should see improvements...
Due to the comment below, you can find the results here: https://www.jume.nl/troubleshooting-disk-latency-cont
Thanks for reading. Actually I did, but to make it more clear I added a link to that post: https://www.jume.nl/troubleshooting-disk-latency-cont.