Groundhog Day and A/B Testing

Jeff Atwood recently made a fascinating observation about the similarities between the classic film Groundhog Day and A/B Testing.

In case you’ve only recently emerged from a hermit-like existence, Groundhog Day is a film about Phil (played by Bill Murray). It seems that Phil has been doomed (or is it blessed) to live the same day over and over again. It doesn’t seem to matter what he does during this day, he always wakes up at 6 am on Groundhog Day. In the film, we see the same day repeated over and over again, but only in bits and pieces (usually skipping repetitive parts). The director of the film, Harold Ramis, believes that by the end of the film, Phil has spent the equivalent of about 30 or 40 years reliving that same day.

Towards the beginning of the film, Phil does a lot of experimentation, and Atwood’s observation is that this often takes the form of an A/B test. This is a concept that is perhaps a little more esoteric, but the principles are easy. Let’s take a simple example from the world of retail. You want to sell a new ring on a website. What should the main image look like? For simplification purposes, let’s say you narrow it down to two different concepts: one, a closeup of the ring all by itself, and the other a shot of a model wearing the ring. Which image do you use? We could speculate on the subject for hours and even rationalize some pretty convincing arguments one way or the other, but it’s ultimately not up to us – in retail, it’s all about the customer. You could “test” the concept in a serial fashion, but ultimately the two sets of results would not be comparable. The ring is new, so whichever image is used first would get an unfair advantage, and so on. The solution is to show both images during the same timeframe. You do this by splitting your visitors into two segments (A and B), showing each segment a different version of the image, and then tracking the results. If the two images do, in fact, cause different outcomes, and if you get enough people to look at the images, it should come out in the data.

This is what Phil does in Groundhog Day. For instance, Phil falls in love with Rita (played by Andie MacDowell) and spends what seems like months compiling lists of what she likes and doesn’t like, so that he can construct the perfect relationship with her.

Phil doesn’t just go on one date with Rita, he goes on thousands of dates. During each date, he makes note of what she likes and responds to, and drops everything she doesn’t. At the end he arrives at — quite literally — the perfect date. Everything that happens is the most ideal, most desirable version of all possible outcomes on that date on that particular day. Such are the luxuries afforded to a man repeating the same day forever.

This is the purest form of A/B testing imaginable. Given two choices, pick the one that “wins”, and keep repeating this ad infinitum until you arrive at the ultimate, most scientifically desirable choice.

As Atwood notes, the interesting thing about this process is that even once Phil has constructed that perfect date, Rita still rejects Phil. From this example and presumably from experience with A/B testing, Atwood concludes that A/B testing is empty and that subjects can often sense a lack of sincerity behind the A/B test.

It’s an interesting point, but to be sure, I’m not sure it’s entirely applicable in all situations. Of course, Atwood admits that A/B testing is good at smoothing out details, but there’s something more at work in Groundhog’s Day that Atwood is not mentioning. Namely, that Phil is using A/B testing to misrepresent himself as the ideal mate for Rita. Yes, he’s done the experimentation to figure out what “works” and what doesn’t, but his initial testing was ultimately shallow. Rita didn’t reject him because he had all the right answers, she rejected him because he was attempting to deceive her. His was misrepresenting himself, and that certainly can lead to a feeling of emptiness.

If you look back at my example above about the ring being sold on a retail website, you’ll note that there’s no deception going on there. Somehow I doubt either image would result in a hollow feeling by the customer. Why is this different than Groundhog Day? Because neither image misrepresents the product, and one would assume that the website is pretty clear about the fact that you can buy things there. Of course, there are a million different variables you could test (especially once you get into text and marketing hooks, etc…) and some of those could be more deceptive than others, but most of the time, deception is not the goal. There is a simple choice to be made, instead of constantly wondering about your product image and second guessing yourself, why not A/B test it and see what customers like better?

There are tons of limitations to this approach, but I don’t think it’s as inherently flawed as Atwood seems to believe. Still, the data you get out of an A/B test isn’t always conclusive and even if it is, whatever learnings you get out of it aren’t necessarily applicable in all situations. For instance, what works for our new ring can’t necessarily be applied to all new rings (this is a problem for me, as my employer has a high turnover rate for products – as such, the simple example of the ring as described above would not be a good test for my company unless the ring would be available for a very long time). Furthermore, while you can sometimes pick a winner, it’s not always clear why it’s a winner. This is especially the case when the differences between A and B are significant (for instance, testing an entirely redesigned page might yield results, but you will not know which of the changes to the page actually caused said results – on the other hand, A/B testing is really the only way to accurately calculate ROI on significant changes like that.)

Obviously these limitations should be taken into account when conducting an A/B test, and I think what Phil runs into in Groundhog’s Day is a lack of conclusive data. One of the problems with interpreting inconclusive data is that it can be very tempting to rationalize the data. Phils initial attempts to craft the perfect date for Rita fail because he’s really only scraping the surface of her needs and desires. In other words, he’s testing the wrong thing, misunderstanding the data, and thus getting inconclusive results.

The interesting thing about the Groundhog’s Day example is that, in the end, the movie is not a condemnation of A/B testing at all. Phil ultimately does manage to win the affections of Rita. Of course it took him decades to do so, and that’s worth taking into account. Perhaps what the film is really saying is that A/B testing is often more complicated than it seems and that the only results you get depend on what you put into it. A/B testing is not the easy answer it’s often portrayed as and it should not be the only tool in your toolbox (i.e. forcing employees to prove that using 3, 4 or 5 pixels for a border is ideal is probably going a bit too far ), but neither is it as empty as Atwood seems to be indicating. (And we didn’t even talk about multivariate tests! Let’s get Christopher Nolan on that. He’d be great at that sort of movie, wouldn’t he?)