Automated UI testing is difficult, especially in a company that moves as fast as TripAdvisor. It seems that every week we have new features and UIs rolling out the door. Unfortunately, with all this development, there is bound to be some bugs that escape our tests. TripAdvisor currently serves 37 points of sale, with over 8 million locations, so comprehensive manual testing is frequently not an option. To aid us in delivering quality bug-free updates I’ve developed an internal tool called VDiff.
For automated UI testing there exists a rigorousness spectrum. On one end, you have tests that look for the most basic of indicators. For example, a test could be: load a page and make sure its title is present. This end of the spectrum can catch your massive site-breaking issues but misses most of the more subtle ones. On the other end of the spectrum you have the stringent tests. These tests often test interactions and logic on the site, but have the fatal flaw that, in a changing world, tend to break often. Ideally we’d want somewhere in the middle of this spectrum. We’d want tests that can find subtle errors on the site involving interaction, but we also want the tests to not fail so often that everyone ignores them.
VDiff starts with a simple assumption: Everything, as it currently exists, is correct. This assumption drastically simplifies the testing process since we no longer need to confirm that all of our interactions are correct; all we need to test is that nothing has changed. This is where our first tool comes in: PhantomJS. PhantomJS is a headless webkit browser that has a scriptable interface. Most importantly, it allows us to take screenshots of whatever page it is looking at. This will make up the core of VDiff. We will be invoking PhantomJS to crawl our website and take pictures. We then compare those pictures to previous crawls of the website, and if they are different then we have a bug! This, of course, isn’t completely true. For example, the number of reviews a location has may increase, causing some pixels to be different. This is why the majority of the work that has gone into VDiff has been towards lowering the noise floor so we can quickly assess if there is an issue.
First, before we can do anything, we need to tell VDiff exactly what to do on the website. To accomplish this, we have an interactive test editor that consists of a directed acyclic graph of actions. For example:
Tests consist of a couple of nodes to set up parameters of the test (for example, which points of sale to test) and then a set of action nodes that involve all the user may do on the website (clicking, typing, loading pages…). In our interactive editor there is a preview pane that lets us try out our new tests before they’re saved. Once we’re satisfied that our test has been written properly, we hit save and it’s automatically included in the next run.
Now that we have our test cases, VDiff generates a set of PhantomJS scripts for them, and then proceeds to execute those scripts. This generates our snapshot of the website. Next we need to compare against the previous version of the crawl which is where our next key component comes in. If we did a simple image diff, then we’d get a lot of false positives from portions of our website expected to update frequently, like review counts, booking confirmation numbers, ratings. To solve this, we’ve introduced the concept of a pixel weight image. This image records how frequently each pixel on the site changes. This means that over time VDiff has learned that the pixels around such fields as the review counts are expected to be different, and as such VDiff weights them lower than the rest. To accomplish that, we needed our own custom image diffing program that could factor in the pixel weight images. Originally, we had a python script performing the task but we’ve recently ported it to Rust which has netted us a 17x improvement in terms of performance.
Pixel weight image showing the booking confirmation number:
The next way we reduce the noise floor is by grouping and sorting the images in the output. For example, if we introduce a bug that made the background for all hotels orange then we’d see hundreds of failed test cases in the output. This increases the risk that other bugs might slip past since we’re focusing on the hotel pages. To solve this, we group all the test cases by their parameters in the directed acyclic graph. For example, if a test case loads a page on all of our points of sale then in the output we would show only one of those tests (whichever had the most different pixels) and allow our engineers to expand that test case to see the rest. This, combined with sorting the output by number of pixels different allows our engineers to analyze the output of all of VDiff’s tests in only a couple minutes.
After clicking to expand a group of tests:
The last way we reduce noise for our engineers is by supporting the ability to generate diffs of any arbitrary snapshots of the site. VDiff normally snapshots the site daily, but if an engineer would like to only look at the output once a week, he can go in to the interface and select a snapshot from a week ago to diff against. This control to arbitrarily diff snapshots also allows engineers to run VDiff against their own code before its even committed.
Menu to select any two snapshots to diff:
Today, VDiff has been a great success in its short life of less than a year. It takes over 2000 pictures of the mobile website each day and has caught tens of bugs that would have otherwise went live. In the near future we’re planning on expanding it to the desktop website, adding support for screenshotting specific elements instead of the whole page, and adding support for distributed crawling to make it more convenient for engineers to VDiff their own server.