A rescue mission!
What a day yesterday!
One of our customers has had a very similar primary/secondary DRBD/VMWare system to ours, setup for their London HQ operations. This means, one server has the VMs loaded, whist another sits in a warm fail-over mode, courtesy of DRBD mirroring of the VMs.
It’s a bit of a resource drain (power mostly, as space wasn’t much of an issue) but it does mean that the VMs can be resumed on another machine really quite quickly, minimising any downtime in the event of a catastrophic host failure.
Well, two days ago, the system experienced one such failure (we’re not sure what it was; perhaps a kernel panic..) and refused to reboot when power-cycled. A few attempts were made by one of my colleagues to diagnose the issue via IMPI, but this just wasn’t as useful as we’d have hoped (difficulties displaying the AMI BIOS and the grub boot-loader to name the main two woes) and eventually we reverted to bringing VMs up on the spare machine.
The problem here is that, at some point, the primary machine (now inactive) was upgraded from 4GiB, to 8GiB of RAM. A costly upgrade (when you consider that all memory fitted to these server boards must be of either Registered ECC or FB-DIMM calibre – age-depending) at the best of times, which meant that the decision before my time to upgrade one host without upgrading the backup, was now quite a problem.
VMs can consume a monumental amount of memory. Be it with virtual allocation of the host’s memory, to the guest, or by the host OS’s caching of frequently-accessed VM disk data (which the kernel sees as ‘just another large file’, for want of a better explanation.) This is particularly noticeable when one attempts to run a Windows 2003 Server VM, with both MSSQL Server 2005 and Idiom’s [hideous|inefficient|bloated|pick-one] WorldServer software. Not a great idea…
As a result, the VM in particular hogs the host machine’s resources, and when we attempted to start a virtual machine with 3.6GiB of RAM allocated to it, on a host with only 4GiB in total.. We had a few performance problems. Uh-oh!
So yesterday, I had another 4GiB memory delivered to the customer’s office, and made my way from Birmingham to London at around 7:45am. I was on-site for about 9am, and much to my surprise, the C.E.O. had already fitted the memory for me! Well, that was nice of him?
However, I did spend the entire day (and quite a bit longer) diagnosing the issues with the primary host (which, incidentally runs Gentoo Linux.) On top of this, I’ve also now catalogued the hardware specifications of the machines (that were somewhat lacking), and even set up an old APC PDU that was lying around, which should give us the ability to power-cycle the machines in future without the pre-requisite telephone call to a random member of staff, in-which I ask them to ‘hold the power button for 4 seconds’ (whilst praying that they’ve found the right machine.)
A good, if not tiring day, given that I didn’t get home until 11pm! At least the taxis/food is all on expenses.