Building on Ziad Bizri’s previous post about how we upgraded our back-end alerts system, we explore what we do when we want to create more advanced alerts.
How We Think About and Construct Advanced Alerts
To illustrate how we create advanced internal alerts to help us learn about how Dashlane is working, presume we work in a factory that produces peanut butter and jelly sandwiches. We know our PB&J production process is not perfect, so we quality check our sandwiches to make sure each one has enough jelly before it is shipped out. Every day, we record the percentage of sandwiches that do not contain enough jelly. This percentage is known as the “defect rate.”
Understandably, we want to know when our defect rate is unusually high or low so that we can determine if it’s by chance or because of something we could otherwise control. So how would we set up an alert to learn this?
Here are the hypothetical defect rates for the past 30 days at our PB&J factory:
This data can be represented visually on a chart as follows:
Now, our task is to create lower and upper limits which will trigger defect rate alerts. Whenever the defect rate is outside of those bounds, it will trigger an alert.
Strictly from visual examination, where should we place the limits? Limits of 4% and 8% would have triggered four alerts during the past 30 days, while limits of 3% and 9% would have only triggered one alert.
So, which option is better? It’s not always easy to “eyeball” good limits and doing so may introduce bias into the results.
How we used to make decisions about advanced alerts
Until recently at Dashlane, we relied on our analysts to manually set a range for each of our alerts through a combination of eyeballing, statistical analysis, and deep, firsthand knowledge of our product. This method was problematic for three reasons:
- It required time and background knowledge, which led different analysts to arrive at different results.
- Changes to our product often changed the typical range of values for each statistic. This created a never ending cat-and-mouse game in which analysts were always trying to keep alerts up to date with the product.
- Constant limits didn’t account for variation in the number of users who were actively using Dashlane on a day-to-day basis (think Thanksgiving Day vs. a normal Thursday). If you’re wondering why this matters, I invite you to partake in a quick thought experiment with me at the bottom of this post.*
How we make decisions about advanced alerts now
To address those problems, we decided to use a statistical tool called the “control chart.”
Control charts are widely used by Quality Engineers and Six Sigma practitioners in manufacturing and other industries to help determine whether variation in a process is due to random chance or an assignable cause.
With control charts, it is possible to automate the recalculation of control chart limits to keep up with an ever-changing product (problems #1 and #2). On top of that, there are certain types of control charts (p-charts and u-charts) that account for variable subgroup size (problem #3) by creating “wiggly limits” based on that variation.
This is what wiggly limits look like in one of Dashlane’s actual control charts:
In implementing control charts for alerts at Dashlane, we encountered one further problem: When the subgroup size used in calculating a single point on a control chart is relatively small, control charts work well. But as we learned, bigger data doesn’t necessarily make it easier to arrive at the correct conclusion.
How we accounted for big data
Something funny happens to control charts as subgroup size increases. The upper and lower limits of the chart begin to creep toward each other until they get so close together that almost every point on the chart appears to be outside the limits.
This phenomenon is referred to as “overdispersion” and was historically circumvented by using a special type of control chart named the X-chart.
The X-chart is normally used to track processes where the subgroup size is one, so it does not account for variable subgroup size. For our application, this was not an adequate solution.
David B. Laney describes a solution to this problem in his paper Improved Control Charts for Attributes.1
Laney p’ (pronounced “p-prime”) and Laney u’ (pronounced “u-prime”) control charts are created through a combination of methods from the X-chart and the p-chart/u-chart. Laney’s control charts account for both within-subgroup variation (like the p-chart and u-chart) and between-subgroup variation (like the X-chart). You can think of subgroups as each representing a point on a control chart. Within-subgroup variation is composed of the variation inside the individual points on the control chart. Remember that each point on the control chart represents a sample of measurements during a time period. Between-subgroup variation is composed of the variation between points on the control chart.
Through Laney’s adjustment, we were able to create control charts and alerts that truly work for big data with varying subgroup size.
To make creating advanced alerts accessible to all Dashlane employees who are SQL-proficient, we wrote a Python script which accepts a small SQL snippet that returns the recent history of any statistic of interest. The script automatically applies Laney’s control chart calculations and feeds the results into the architecture described in Ziad’s previous post in order to notify the employee via Slack (our internal communication tool) any time an alert is triggered for their metric.
I hope you have a better understanding of how Dashlane thinks through alert creation, and how we accomplish accurate alerts to help us drive a better user experience by identifying outliers.
If you have any questions or comments, please leave them below and I’ll do my best to answer them. Thanks for reading!
- Imagine flipping a coin 10 times. Assuming the coin is fair, you may expect to see a 50-50 split (5 heads and 5 tails). However, it would not be especially unlikely to witness 6 or more heads. Over the long run, this will happen about 37.7% of the time the experiment is performed.
- Now imagine flipping the same coin 100 times. The chance of witnessing 60 or more heads is about 2.8%.
If we created an alert to notify us whenever the percentage of heads was 60% or greater, the triggering of that alert would be influenced by the number of flips.
1 (David B. Laney (2007) Improved Control Charts for Attributes, Quality Engineering, 14:4, 531-537, DOI: 10.1081/QEN-120003555).