Don’t Rush Your A/B Test!

If you have been working in game development for a while I have no doubt that you’ve used A/B testing at least once. It’s a powerful tool, allowing you to check various hypotheses with data before introducing any changes to the entire game’s user base. There is no space for gut-feeling here; we should rely only on pure numbers. Sounds pretty straightforward -- could anything go wrong?

Apparently, it could. In the world of A/B testing there are a lot of pitfalls. Here is a list of the most common ones:

Running the A/B test without a clear hypothesis and metric to measure
Jumping to conclusions before getting statistical significance
Uneven distribution in the variants between geos, platforms and whatnot
Incorrect implementation of the A/B test conditions
Issues with the correct assignment of an A/B test group
Too short of an A/B test (i.e. one day only), resulting in biased data

In this article we will be focusing on the issues caused by rushed decisions.

Here at Kongregate, we care a lot about data-driven decision making. Thus, whenever it makes sense, we try to test major changes on smaller cohorts before implementing them. This ensures that the changes in metrics match our expectations. I’m going to walk through an example we ran into with one of our games and showcase how the decision, if made too early, could hurt your game instead of helping it.

One of the games from our portfolio I’m working with is Office Space: Idle Profits (see here: iOS or Android). Some time ago our team was working on improving the monetization metrics. They had developed a set of changes that was aiming to increase player conversion. When all the backend work was done we set up an A/B test. It had two variants:

Control: the game as is, no changes
Variant: the users got the changed version of the game that had a feature called an ad hoc sale. Players in the variant were getting a pop-up window with some special deal that was active for only a limited amount of time.

The expectations were that the changes, surfaced in the variant, would result in a better conversion rate. Since we were selling discounted items we understood that this could affect our ARPPU, but the assumption was that the conversion growth would cover the ARPPU losses. So at the end of the day we were expecting to see a higher ARPDAU.

Our target audience consisted of new users because we wanted to make sure we were comparing apples to apples. As far as we were testing the conversion rate, which is not that big in freemium games, we had to get a lot of users into each variant to make sure our results were statistically significant. After getting the required number of installs we had to let the players age to see their metrics.

The A/B test is an awesome tool with numerous pros and some cons, too. The biggest issue is the time you need to spend in order to get the result. A/B testing something means that you defer the implementation of some awesome feature that could generate more money to some time in the future. In our case the most optimistic scenario was to wait 4 weeks.
Why wait so long, you could ask. Well, as I mentioned before, all the monetization tests require many users -- it took us about 2 weeks to get the number of installs needed. After that we had to wait for another 2-4 weeks to let these users age to check their KPIs.

Initially our A/B test was showing incredible results -- the control and variant groups had the same early retention, the variant was slightly worse in terms of older retention, but cumulative D10 monetization results were impressive -- the variant had +27% conversion. As expected, the ARPPU was slightly lower, but the ARPU for the variant was 9% higher versus the control ARPU, which meant an additional 9% in revenue.

We were excited but decided to check out the D14 cumulative results. It happened to be that the difference in ARPU between the control and variant had decreased. In other words, the improvements were not that considerable.

Still, the D14 cumulative ARPU was 4.7% higher. That is not that bad, so the team was feeling optimistic. Seeing the trend, however, I suggested to hold on and see the older user metrics. That was suggested mostly as a sanity check, to make sure we’re not going to hurt older users’ monetization metrics, but the outcome was more than surprising. The variant didn’t look better than the control. For D30 it was actually worse!

After such results it was clear that we couldn't introduce this feature to the entire game population. Even though it took a lot of hours of development, we had to listen to the numbers and leave everything as is.

Now you see that not rushing an A/B test is as important as making all the calculations correctly.

After discussing the results with the team we came up with a couple of other similar ideas. We are planning to A/B test these ideas soon and will hopefully find a variation that achieves our goals!