VMs Frozen After a Mass Reboot

I manage a virtual call center of 400+ VMWare workstations hosted on a cluster of ESX 3.0 servers. In order to make sure all users are logged out, any hung processes are killed, and locally cached profiles are deleted, every workstation has a scheduled task to reboot once per week. In a physical call center (which I also help manage), this taxes the DHCP server a bit, but otherwise doesn’t hurt anything.

The ESX server concept assumes that most virtual machines will not be running full tilt all the time. That’s usually true. Most computers, most of the time, run at well below 50% capacity. The value of the ESX server lies in sharing those normally fallow resources out among multiple virtual machines. If the average CPU and memory utilization on 100 Windows XP desktops is around 50%, then the ESX server only needs 50% of the actual physical resources to service the same number of virtual machines. When one machine is running at 95%, another is probably only running at 5%. ESX dynamically allocates CPU cycles and memory to whichever VM needs them. This works great until you have a scheduled job that launches on all 400 virtual machines at the same time. Rebooting takes a lot of resources.

Needless to say, once every week all the ESX servers went Red. By itself, that’s still not such a big deal, just annoying. Unfortunately, like every other battery in caught in the Matrix, the virtual machines don’t know they’re virtual and really don’t like beind denied access to what they believe are their exclusive resources. Sometimes they are so unhappy that they go on strike.

The lesson? Stagger scheduled tasks so that every virtual machine is not running at full capacity all at the same time. It’s a little harder to configure, but it will save you time in end user support.

Comments are closed.