Scaling Selenium Test Execution

Selenium is a fantastic, but challenging, tool for doing end-to-end, browser-based testing on web applications. Most people with a background in unit testing who try out Selenium bemoan how incredibly slow it can be. It is slow to execute a test, slow to develop and slow to debug. However, fully automating the user-centric black box testing through the browser is one of the holy grails of testing. It allows you to dramatically increase your confidence that code will run correctly in production.

To deal with the time it takes to run Selenium tests, it is necessary to run tests in a distributed and concurrent way. Although Selenium does have a way to run browsers in parallel using Selenium Grid (built into Selenium 2), it does not provide an infrastructure for distributed the test execution.

In my current job, we run over 580 tests with roughly 20 hours of test time in a time window of about 40 minutes (plus some setup time). This includes a significant number of tests that must run in isolated environments because they cannot be run concurrent with other tests (due to changing server time or configuration). We've achieved this through a process of fully distributing the systems.

The central server in this distributed system is our main integration server. Its responsibility is to start builds and manage Selenium grid, the collection of remote browsers that execute tests. The main integration server runs Cruise Control, although it could easily run another distributed continuous integration system such as Jenkins or Hudson.

When a build request comes in, the main integration server starts a build on one of our designated 'king' servers. A king server is configured as a distributed Cruise Control build server and contains an environment that corresponds either to the trunk development line in our code repository or a development branch. In fact, the development branch servers are assigned to individual development teams, so that they can run a build at any time.

The trunk king server plays a special role in maintaining an up-to-date version of the database. It receives a copy of production data each week, which is used in the build system. Database change scripts are executed against this king server on each deploy (if there are any) and the server runs nightly crons.

The other king servers, which are allocated one per development team, clone a snapshot of the trunk king server database on deploy. After they clone and mount the database, any additional deploy scripts that originated on the code branch are applied. Then the full build is executed.

The full build consists of both unit and Selenium tests. The unit tests are executed locally, whereas the Selenium tests are executed on a 3rd type of server called a 'Serf' (i.e., subservient to the King). The king starts one or more serfs (currently we run 12) and the serfs in turn clone a snapshot of their king, deploy code and query the king for available Selenium tests. The serfs continue running Selenium tests until there are no more remaining.

The serfs are capable of running either concurrent or non-concurrent tests. Non-concurrent tests are those that either need to change the server time (to simulate behavior in production) or modify configuration. Given the complexity of our current software, there are actually quite a few non-concurrent tests. Parallelizing the non-concurrent tests in separate environments is the single most important action to improve the speed of the builds. It is easy to run concurrent tests in parallel on a single server, but non-concurrent tests require separate environments.

To re-cap, are infrastructure looks like this:

1 Integration server (running Cruise Control and Selenium Grid)
1 Trunk king server
2 Team king servers
12 Serf servers
20 Remote desktops running Selenium 2 browsers
1 Dell SAN (providing ability to snapshot and clone disk volumes)

All of these servers are virtual machines.
---
More posts:
Setting up new Project in GitHub and Android SDK
ClimbingWeather.com Expense vs Income Report #1
Simple Subversion Repository Setup
Simple Python Server Backup Script
Using JQuery to Warn Users About Losing Data When Navigating Away from Form