Small solution to an Enterprise problem

Given a set of VMs representing possible client configurations, a set of Selenium test scripts, and a collection of multiprocessor servers, how can the process of running all tests against all VMs be automated and scaled to available processing capacity?

On-demand tests against a developer’s copy of the product would also be useful, as would a means of testing computers for compatibility at the client’s site.

Back in May I wrote about the script-based VM management system I put together in preparation for automated UI testing. While sufficient for building our test server and managing our production builds (which it has been doing very well for over a month now) it’s become apparent that maintaining the scripts necessary to run a testing farm is going to quickly become impossible. Powershell’s great for small stuff, but something this big needs some serious software behind it.

There are existing solutions out there which do what we need, eg. Surgient‘s Virtual Lab Management Applications, or VMware’s own Lab Manager, but they tend to be aimed primarily at the larger businesses that have either the infrastructure or budget (usually both) for Enterprise-class solutions. We don’t have the space or staff to run a datacenter, and a managed solution doing the amount of processing we need does not come cheap.

So the plans were laid for building a smaller scale solution to the problem of automating tests across multiple VMs. We stuck with VMware, because our existing infrastructure is based upon their free Server product and the competing products weren’t competitive enough for converting to be worthwhile. The target machines for this system were to be three multi-CPU rack servers, which will later also take on the burdens of the three glorified workstations that currently occupy the server cupboard.

We looked for a way to abstract the VM capacity of the physical servers and treat them as a computing cloud. The original plan was to write some daemons in .NET that would use VMware’s VIX API to talk to the VMware Server instances, but VIX functions often take callbacks and reverse P/Invoke didn’t want to play nicely. While it might’ve been possible to write the daemons in C/C++ instead, it was far cheaper just to fork out for VirtualCenter 1.4 and learn to use the Virtual Infrastructure Management (VIM) SDK from C#. This basically reduced our complex distributed solution to a single service that could use VirtualCenter to do its dirty work.

After a little research it was discovered that switching to ESX Server would cost too much after factoring in VirtualCenter 2 and the necessary hardware (SANs do not come cheap), although it would have let us use version 2 of the VIM SDK, which has proper C# bindings. It was decided that our computing cloud would run a lightweight Linux distro to minimise the overhead of the native OS on our cloud’s nodes. Gentoo was suggested, but I wasn’t inclined to spend eight hours installing it when Ubuntu Server is so much quicker and simpler. We have an evaluation copy of ESX somewhere; at some point I should probably determine exactly how much more efficient a hypervisor would make our computing cloud.

My initial reaction to the version 1.4 VIM SDK was not entirely favourable. I wasn’t happy about the use of HTTP webservices for everything, since this means that the client has to pull updates down from the server via a message loop and there was no C# proxy API provided. Implementing that message loop gave me a headache to begin with; it’s uncharted territory for a SOAP newbie like me :P. But after a little hard work and some background reading I began to understand the protocol and the sense behind it, and now we have a nice extensible C# proxy for it which handles object synchronisation entirely in the background.

(I’m still confused about the use of XML diffs. Surely if minimisation of network traffic is a priority, XML isn’t the best format to use anyway?)

Once we’d determined which parts of our original design worked and what could replace the bits that didn’t, a new design was drawn up. It mutated slightly over the following week or so, but has stayed broadly the same.

VirtualTest design

  • The TestBuilder generates ISO images containing the necessary programs and data to run the tests. These ISOs will autoplay if attached to an active VM. VIM doesn’t seem to support attaching devices while a VM is running, however, so each VM includes a service that autoplays an attached ISO at system start time.
  • The Resource Manager handles scheduling of jobs. It knows nothing about these jobs except the resources they consume. The Resource Controller (the Managed.VIM namespace, used for talking to VirtualCenter) eventually got split out of this module and became a separate entity.
  • VirtualTest runs as a service and links everything together.

The use of an ISO to inject tasks into a VM is rather similar to the ‘parameter disk’ concept used in our Powershell-based system. It’s also the technique VMware use to install their Tools package on a VM, which incidentally is where I got the idea in the first place…

The design above is actually split into two parts. The testing system consists of the dashboard, the VM, and the contents of the ISO. VirtualTest (the green part) is really just a job control system dealing with VM processing resource; TestBuilder would be better named IsoBuilder, and we’ll have to rename VirtualTest too. Strangely, the potential flexibility of this system only really became clear once the design had been adjusted to make it fully testable. Yet another victory for TDD, I feel.

This split also made dividing the workload simple. I worked on the VirtualTest system, and Ben has been developing the testing framework using Ruby and Selenium. Integrating the two systems consists of implementing a Job object that will build and deploy the appropriate ISO to the VM; apart from that, neither system need know anything about the other, since test results are uploaded to the dashboard by the testing software running from the ISO. We also get for free a way to do on-site testing of client configurations, because the ISOs can be burned to CD and run on any machine.

At this point in time, VirtualTest is not quite ready for deployment. The CLI tools do not yet work across TCP (I may well solve this by dropping Remoting and just using WCF instead) and all they do is add a simple demo job to the queue. The ISO builder for tests has not yet been implemented. Support for on-demand tests is on hold until we actually have something working.

The core of the VirtualTest system demonstrably works, however, and will soon take over from the Powershell scripts that currently schedule our nightly builds. It’s almost complete enough for our needs, but the following improvements could also be made:

  • Fix resource management so resources can be assigned properly. Resources requested by a job and resources viewed as allocated to that job should be the same thing, but presently the resource manager isn’t smart enough to remember how much is allocated, and instead just looks at how much is being used.
  • Fix the whole thing so multiple resource types can be managed. The only resource it understands is ‘VM capacity’. It’d be nice to be able to track registration key usage, so we could make optimal use of Windows product keys and ensure that only one active VM is using a given key at a time; currently every VM has to have a separate key whether it’s running or not, which is somewhat wasteful.
  • It’d also be nice to allocate CPU and memory intelligently instead of abstracting these as ‘VM capacity’, but that’s probably out of our reach.
  • Fix the whole thing so jobs can be scheduled dependent on multiple resource types. This ties in with the registration key thing.
  • Make the Managed.VIM proxy safe for multithreaded use without requiring lots of locking in client code.
  • Ensure that Managed.VIM works seamlessly with VIM version 2, so migration to ESX Server is easy.
  • Guarantee that all Job objects are serialisable, allowing them to be instantiated by a client app and sent across the wire to the VirtualTest service. At present, the Remoting interface demands a Type and an array of constructor arguments, which is horrible.
  • Improve unit test coverage. The Managed.VIM assembly is mostly untested. We need integration tests for this as well.

I’m going to move the codebase over to Sourceforge as soon as possible, at which time I’ll blog again. The ‘testing’ component of this system is Ben’s domain, so I’ll let him explain it.

UPDATE: We now have a Sourceforge project, called VirtualCloud. The codebase has been migrated and the above improvements added to the Feature Request tracker.