Building a New A/B Testing Engine at Dashlane

November 27, 2019 | Julien Beaussier

Let's say you are updating some feature of your product, or adding a new one. You want to make sure it'll have a positive impact.

There are several ways to do that: "Before / After" method is the most simple one. You release and then monitor results. Eventually, the results are positive and you maintain your course, or the results are underwhelming and you roll back the changes. But sometimes, you don't want to take the risk and you need a way to measure the impact before you release your change to everyone.

A/B testing is one solution to that problem. At a high level, it consists of following steps:

Define the metrics you want to monitor with the A/B test (clicks, conversion, retention, etc)
Pick only a subset of your user base and split them randomly in two groups (the "randomization")
Expose the change, (sometimes called "experiment", "variant", or "treatment") to one of those groups.
Expose the original, unchanged feature to the other group (the control group)
Compare your metrics for both those groups
Consider the potential for any confounding variables that influenced your results
If the "experiment" has better results, release it to everybody

In this post, I'll discuss how we think about A/B testing at Dashlane, and why we recently invested in developing a new testing engine and framework in-house instead of purchasing a third party solution.

A/B Testing at Dashlane

A/B testing is one of Dashlane's key growth levers. Each day, we have dozens of A/B tests running to identify the optimal user experience for our customers. We test a range of experiences, from the order we show customers our features to onboarding, in-app design, messaging, and more.

For example, at some point we wanted to improve the account creation flow with a new design.

We were so confident that the experiment experience was superior, we actually considered a full rollout. But instead of trusting our intuition, we ran an A/B Test to get a quantifiable result. Much to our disappointment, we saw a 16% drop in the metric we were tracking. Our intuitions were totally wrong, and without A/B testing we only would have witnessed this decline in performance after implementing to 100% of our users, rather than a smaller, randomized cohort.

Because A/B testing minimizes the risk of a new experience hurting your core metrics, it also unlocks a greater willingness to try lots of new things and see what works. Recently, as part of an effort to improve iOS account creation rates, we wanted to test whether external validation of Dashlane would increase customers' willingness to give our product a try. We tried to adopt an idea that is becoming industry best practice: highlighting some of the awards and recognition we received over the years. Our thinking was that this would give prospective customers more confidence that Dashlane was a safe and effective product. We figured external validation from recognizable brands and thousands of happy customers might do the trick.

Around 50k users were involved in the test, and in short order the experiment revealed a 4% increase in account creation compared to our control group. So we decided to rollout this slideshow to every new user!

The Old Engine

Until recently, Dashlane used to conduct tests with what we now call "The Old Engine." Built few years ago, its initial goal was to test pricing and plans. Dashlane was a much smaller company at the time – operating on fewer platforms, and too few users to run multiple tests in parallel and reach statistical significance in a reasonable amount of time. The old engine became increasingly unreliable as we grew in size, and as we identified mistakes in our initial framework for A/B testing , which sometimes led to unreliable results or non-optimal user exposure. These mistakes included:

Users were assigned at account creation, meaning each one could be only be involved in one test for its entire life time. This meant if we want to test something happening at D120, we have to wait 120 days after the test is released to actually start gathering results.
We were too bound to pricing, and every single test was linked to a list of plans/special offers. This made any change a bit tricky and fearsome in the end... you really don't want to mess up the whole pricing framework just to test a new button!
The number of users in the Control group and in each variant were not always the same scale (10x to 100x more users in the control group). This occasionally made results unreliable or difficult to read.
Some users may have been exposed to treatments several times before we actually processed the results, making the results even more unreliable because of selection biases.
Every single test needed some backend engineers to implement it. This meant randomization was somewhat hard-coded into the API

The New Engine

When we reached consensus to change our A/B testing engine, the first question was: "should we do it ourselves, or should we rely on a third party / open source one instead?"

We saw a few issues with using a third party platform:

Technical stacks were really different than what we were used to at Dashlane: We didn't want to risk putting our Server Engineers in the difficult situation of being the only person working one weekend and have to fix incidents on an unfamiliar tech stack!
It could have been complex to integrate in all our products/tools - for user support, API for clients, data for Analytics etc.
In terms of privacy and security for Dashlane and our customers, we preferred to have an in-house tool rather than a third party solution.

So we decided to do it on our own and benefit from all the cumulative knowledge and experiences from our Old Engine. Our main upfront constraints were the ability to conduct several tests per user (so that we could run more tests in parallel), and ensuring the reliability of our testing framework so we were confident in the results. To face this challenge, we needed two really important things: a code name and a motto.

Because we have a lot of imagination here on the Server team, we named it... "The New A/B Testing Engine."

The MOTTO was a bit more scientifically chosen by our Data Analyst involved in the project - "Let's Compare Bananas with Bananas!"

Under the Hood

To avoid selection biases, we built our new engine with several precautions in mind:

Our new test engine was built a way to make sure that, if needed, we could ensure users were not be selected in concurrently running tests.
In order to restart an A/B Test properly, we made sure that users that have been exposed to a treatment couldn't be part of the treatment again, excluding them from the test results.
To validate and QA the solution, we conducted some "A/A tests", where the variant and the control group are identical. This way, we could avoid "orange CTA" beating "orange CTA" with statistical significance.
A visible difference would have mean the randomization is biased and is not as random as it should be.

To maximize the number of tests we can run in parallel, we reused parts of our Feature Flip engine (our tool for controlling progressive roll-outs) that make targeting specific segments a more flexible and customizable process.

All this work means:

We can now decide how many users are targeted
We have a full control of the A/B test lifecycle: Stop, start, reset, archive, eject people from it, all of that with very few clicks!
No more need to update the APIs to set up the hardcoded test.
We also took the opportunity to build nice tools for our Customer Support & Service team to troubleshoot and manually control users exposure to A/B tests

Final Takeaways

A/B Testing is a science, and requires subject-matter expertise in both technology and statistics. Nothing would have been possible without the involvement of our Data team.
Don't be afraid of building a test engine yourself. If you can budget the time and engineering resources towards the initiative, there is a lot of upside to having full ownership instead of paying a third party.
We are just starting to using this new engine, so it will require few months of follow up and tuning to get the best of it.

That's it. Thanks for your time reading, and if you have any share-worthy learnings on A/B testing, please share in our comments section!

And did we mention that we're hiring engineers in New York, Paris, and Lisbon? Check out our careers page!

Julien Beaussier