When Dashlane’s User Support team contacted me to tell me that our users couldn’t create accounts, I felt like an idiot.

I am, after all, in charge of Dashlane’s back-end systems.

But when I looked at the monitoring we set up, I saw that all our instances were healthy, the load balancers were in good shape, and there was no noticeable change in network traffic. And since our business dashboards are only updated once per day, I wouldn’t have seen anything suspicious for almost 24 hours.

If User Support didn’t say anything to me, there would have been no way for me to know what was happening in real time. The metrics I was following closely, sometimes at the second granularity, were not good indicators of critical business changes that required immediate attention.

It was at this point I realized that something had to change.

Building an Alerts System to Improve Business Reporting

So, my team decided to build our own system to alert us when any critical business metric was in need of attention.

This is how we did it:

  1. First, we listed indicators that were important for our business: account creation, new subscriptions (on each payment platform), successful automatic renewals, handling receipts of external providers (Google Play, Apple App Store), etc.
  2. Then, we added monitoring usage data from anonymously collected data in order to get alerts per platform and release version.
  3. Finally, we set up a process that alerted us after each production incident. That way, we’d know which incidents were covered and which were missing.

There was a problem though: The data for all those indicators was distributed across several sources.

For example, we use SQL databases for everything related to account creation, purchases, etc., but we use NoSQL databases for storing the anonymous usage logs.

A couple other obstacles we wanted to solve for were:

  • We use an internal chat tool and ideally we would get alerts and notifications sent directly to us via chat.
  • We couldn’t rely on yet another database structure for data collection and alerting, as that would mean additional maintenance and engineering resources that we don’t currently have (#startuplife).

So the decision was made to use ElastAlert, an open-source project by Yelp, that uses data in Elasticsearch for generating alerts. It also integrates well with different chat clients.

Dashlane's New Alerts System

How Has It Worked Out?

We now use Elasticsearch extensively—we have a cluster handling more than a terabyte of data per week. Every engineer in the company uses Kibana to visualize the data collected by Logstash instances. For this project, we have a process that continuously polls our various databases and stores the alert data points every five minutes as Elasticsearch documents in a separate index. Elastalert then queries this data, uses the count aggregation feature of Elasticsearch, and is able to generate alerts. The stored documents have a reduced cardinality compared to the original data (hence some loss of information), and carry over pre-aggregated counts, but there’s enough information to make interesting alerts.

For instance, a typical document would be something like this:

{
    @timestamp'April 11th 2018, 08:13:48.000',
    count: 76,
    browser: 'chrome',
    alertname: 'autofill_usage',
    autofilled: false,
    filledByUser: true,
    version:  '5.10.0.18479-dmg'
}

Elastalert then allows us to check for threshold values (e.g. no account creation), spike values (e.g. a big issue in the current release that needs to be reverted), or blacklist or whitelist (e.g. unexpected values), which provide us with a decent query expressiveness.

Today, we have a very stable system that alerts us to most of the scenarios we planned for.

But that isn’t enough!

We are getting false positives because the time interval is too low, false negatives because the interval is too high, and, because we need time before the alert is triggered, we spend a lot of time fine tuning individual alerts. Also, the system doesn’t allow us to compare trends over long periods of time or check that a value is within an acceptable range given our historical data.

That’s why we’re working to integrate more advanced queries into our new system to solve for those problems.

Thanks for reading! If you have any questions or comments, please leave them below and I’ll do my best to answer them.