Trending
Opinion: How will Project 2025 impact game developers?
The Heritage Foundation's manifesto for the possible next administration could do great harm to many, including large portions of the game development community.
Featured Blog | This community-written post highlights the best of what the game industry has to offer. Read more like it on the Game Developer Blogs or learn how to Submit Your Own Blog Post
This post explores the pitfalls of A/B testing in mobile games, and advises caution when trying to interpret the results of experiments which can lack good control.
I hear the phrase "A/B testing" on an almost daily basis. It's often touted as a cure-all for game design decision making - remove personal bias from the equation, and make data-driven decisions because "the numbers don't lie" (like in Mark Robinson's article here). Now I'm not saying that A/B testing can't work, or can't be effective... but as with a lot of things which cross my desk, the devil is in the details.
Consider the following: We have published a F2P racing game, where users earn soft currency by completing races, with new cars/upgrades costing soft currency to purchase. You can enter only 10 races per day each costing one 'action point', with the option to buy more action points or more soft currency via IAP. User retention is good, but maybe UA is a little pricey given the game's relatively narrow target audience, so the execs are looking for a way to improve ARPU.
During a design meeting, the suggestion is made to change the UI so that the upgrade screen is visible ahead of the currently prominent race screen in the main menu... but after some discussion, the team divides. One side thinks this is a great idea; it will improve the ARPU by improving the visibility of the upgrade screen, a sink for in-game currency. The other side disagrees; downgrading the visibility of the race screen will make users run fewer races and therefore use up less of their action points, another important sink for IGC.
How does the team resolve this debate? How can we really know which decision is right? Invariably, the suggestion is made to A/B test. So, the programmers go to task and a change to the menu flow is made via DLC. 50% of new and existing users will now see the new menu highlighting upgrades, and the other half will see the old one highlighting races. In three weeks, the design team will have their answer... or maybe not.
Two days before the DLC change, the marketing team changed their UA strategy to use a different mix of advertisers and upped CPI bids in Tier 1 countries. Coupled with that, a week into the test the programming team fixed a server bug that was slowing down the download speed of new racetracks. Finally, during the two weeks since the change maybe we had a holiday like Thanksgiving, which prompted an in-game sale on all IAPs for Black Friday.
The assumption is that with enough users, the effects of everything but the menu change will even out: the so-called Law of Large Numbers. However, both the Black Friday sale and the higher ratio of Tier 1 users might have affected which IAPs users are making and thus which in-game currency sinks were most accessible. And who's to say how much shortening the download time of new racetracks might have improved engagement with that feature? Now it seems a lot harder to argue that the DLC menu change is the real cause of any changes to ARPU or user behavior than it did before the test. In reality, there is almost no way to individually quantify how much of an effect these coincidental changes might have had on the game.
At its core, any A/B test of a new game feature is really akin to running a psychological experiment with a treatment and control group, and then trying to determine if there was a statistically significant effect of the experimental manipulation. As a researcher in a psychology lab, you can take deliberate measures to ensure that your manipulation is the only difference between your treatment and control groups. As a game analyst, your experimental groups are the unfortunate victims of a whole host of factors outside of your control.
So where does that leave us? My point here is not that A/B testing can't be done, nor is it that data-driven decisions aren't the way to go in the F2P mobile ecosystem. Instead, I'm suggesting that before rushing to suggest another A/B test, both analysts and designers should consider the cost of achieving real experimental control - namely sacrificing the ability to make almost any other changes to the game - or run the risk of trying to make good decisions with bad data.
Read more about:
Featured BlogsYou May Also Like