Setting up the core dump partition Print
Sunday, 01 April 2007 10:47
Article Index
Setting up the core dump partition
Creating the Shared diagnostic on SAN/iSCSI
A PSOD, now what?
All Pages

When an OS crashes it would like to do a dump with the latest changes in memory, accessed devices and last commands, as well for the VMkernel. This document describes the different possibilities for setting up a core dump partition, also known as the ‘diagnostic’ partition.

Intro

In case of a PSOD (a Purple Screen of Death), see examples here, the VMkernel dumps it’s active memory to disk. It can only do this to a vmkcore dump partition, a partition which is in type FC. This partition can be created during installation or later as this document describes.

Here is an overview:

1-vmkcore

  • Private local diagnostic

The first one is used for installation on the local storage (internal disk). During installation you need to create a vmkcore on the local disk with 102MB in size.

  • Private diagnostic on SAN/iSCSI

The second one is used for a boot-from-SAN configuration. There needs to be a private lun for every Service Console. On the same lun we could create a vmkcore of 102MB during installation. Every Service Console/VMkernel will have access to its own disk and will do a dump in case of a PSOD.

  • Shared diagnostic on SAN/iSCSI

The last one could be used for local Service Console installation or boot-from-SAN installation. There a lun is needed with at least 1,6GB free space (a separate dedicated diagnostic lun is recommended). This partition cannot be created during installation, but can be created on a running system using the VI client.

If you are installing VMware ESX on a local disk, the first solution is the best to go for. Local disk tend to be the most reliable (given your using some kind of RAID1 / RAID5 solution). You could choose for the centralized vmkcore (but still do a local boot), but in case of SAN failure & PSOD, we cannot do the dump. If you do not have local storage (when using diskless server or Blades) the second and last options are the way to go.

The 1,6GB shared diagnostic

The shared vmkcore partition is quite interesting since there is almost no documentation about this one. The online library ‘SAN configuration guide’ quotes:

“Sharing Diagnostic Partitions

If your ESX Server host has a local disk, that disk is most appropriately used for the diagnostic partition. One reason is that if there is an issue with remote storage that causes a core dump, the core dump is lost and resolving the issue becomes more difficult.

However, for diskless servers that boot from SAN, multiple ESX Server systems can share one diagnostic partition on a SAN LUN. If more than one ESX Server system is using a LUN as a diagnostic partition, that LUN must be zoned so that all the servers can access it.

Each server needs 100MB of space, so the size of the LUN determines how many servers can share it. Each ESX Server system is mapped to a diagnostic slot. VMware recommends at least 16 slots (1600MB) of disk space if servers share a diagnostic partition.

If there is only one diagnostic slot on the device, all ESX Server systems sharing that device map to the same slot. This can easily create problems. If two ESX Server systems perform a core dump at the same time, the core dumps are overwritten on the last slot on the diagnostic partition.

If you allocate enough memory for 16 slots, it is unlikely that core dumps are mapped to the same location on the diagnostic partition, even if two ESX Server systems perform a core dump at the same time.”

In other words: if a server does a dump, it will choose one of the 16 ‘slots’ available from the 1,6GB diagnostic partition. In fact, the slot the server uses dependents on the machines UUID. If we check the manual page for vmkdump, we will find:

“On dump partitions that reside on shared storage, multiple machines can share the dump partition. Because of this, the dump partition is split up into multiple slots. Generally, when retrieving a core dump, the physical machine's UUID is hashed into a slot number based the total number of slots, and the compressed dump is automatically read from that slot.”

 



Comments (0)
Write comment
Your Contact Details:
Gravatar enabled
Comment:

!joomlacomment 4.0 Copyright (C) 2009 Compojoom.com . All rights reserved."

 
Did you know: that ESX checks every 20ms to migrate a vCPU to another pCPU for the optimal workload balance. This is configurable (0ms - 5000ms) in Cpu.MigratePeriod in Advanced Settings of you ESX server.