What Enhancing Observability Looks Like at Dashlane

Introduction

With our rapid business growth, like any other IT company in the industry, we face challenges with our platform’s scalability from time to time. We always try our best to minimize the impact on our customers, but these are sometimes complex problems to solve. 

Most recently, while working with scalability challenges, we discovered an opportunity to enhance observability tooling, which provides rich and actionable real-time information about platform availability and performance. By putting such a tool in place, we can take action more quickly and proactively in the future. This article describes how we improved our system’s scalability, key lessons we learned, and the actions we put in place to mitigate future risks.

What happened

On July 11 and 12, 2022, our monitoring tool reported some slow server response times, sporadically impacting some of our customers’ user experiences. The incident was immediately posted on our status page. Randomly, customers faced slowness where the applications appeared to be hanging and waiting for the backend to respond.

Root cause

A critical component of our backend performance is our rate limit mechanism. It throttles API requests when an unexpectedly high amount of API requests are called by a client application. Therefore, our rate limit mechanism is involved in every single API request. It leverages a Redis in-memory database hosted on AWS Elasticache. The cause of the incident was that the primary node of our Redis cluster was capped at 20% CPU and wasn’t powerful enough to serve all the load. The cluster was running on some cache.t3 nodes that are bound to 20% CPU unless “CPU Credits” are consumed. If you’re unfamiliar with this concept, you can find more details on AWS reference documentation. We discovered that the primary node was running out of CPU Credits and therefore was capped at 20% CPU, which was not enough to serve efficiently. This created a chain of events that resulted in intermittent slow response times when using our apps.

The following sections of this article cover our post-mortem exercise to answer our questions about the root cause. We also used this exercise as an opportunity to improve our tools and practices to prevent future recurrence of this issue.

Timeline

Mid-May

CPU Credits start decreasing. At this point, we have no specific monitoring of CPU Credit usage.

May 31

⚠️ The first spike of high response time is reported by our monitoring tool.

It’s a very short burst, and we don’t see anything wrong in our system. We suspect some random massive JSON parsing added some delay in our NodeJS event loop. We try to correlate the spike to some of our key metrics, such as DB and EC2s CPU usage and memory, without finding anything. It turns out that it’s the first time we’ve been running out of CPU Credits, which we would only discover after reviewing this incident later.

This incident is posted on Dashlane’s status page: Incident – Dashlane services partially disrupted.

June 8

Chrome Extension 6.2222.2 is released. This version includes a new phishing protection mechanism.

This mechanism calls the backend periodically, increasing the load on the rate limit cluster.

CPU Credit levels decrease at an accelerated rate, but there’s still no sign of user impact.

June 24

⚠️ Some warnings appear in our monitoring tool, but again, there is no real impact on the customer experience.

A study of DB usage highlights a newly introduced, heavy SQL request involved in the new phishing protection mechanism.

We immediately take action to improve the caching of this component, which brings some additional stability.

June 27

Extension 6.2224.4 is released more quickly than usual, putting more load on the database.

⚠️ New warnings are raised by our monitoring, but again, we don’t see any real impact on the customer experience.

We start to suspect the overall load is responsible for our issues but assume it’s caused by the accelerated rollout of the new extension version. In particular, when a new app version is released, it generates hundreds of millions of requests on the service responsible for updating the version of the apps on our Device table.

We then take several more actions:

  • Discuss with client teams to adapt the pace of our extension’s progressive rollout
  • Offload updates of appVersion fields to some side workers to avoid burdening the APIs with that process

July 5 to July 7

⚠️ Sustained high response time for a longer period, with minor customer impact. We receive a few complaints about sluggish app response time.

We start to rule out some potential causes of the poor response time by upgrading our DBs:

  • July 6 – Upgrade our MySQL AWS RDS database from db.r5.4xlarge to db.r5.8xlarge
  • July 7 – Upgrade our AWS DocumentDB cluster

We apply those changes in the morning, Paris time (CET), and the results feel positive. We notice a decrease in response time.

But a few hours later, when North America comes online, our monitoring still shows problems.

July 11

The iOS team accidentally skips progressive rollout, instead releasing a new version to all customers simultaneously. Regardless, we figure the backend should be fine because the load is manageable.

At the same time, Extension 6.2226.5 is released to 5% of our existing customers.

We still wonder if those releases contribute to our slow response time.

⚠️ New alerts :'( Incident is posted on Dashlane’s Status page: Incident – Slow Response Times on Server.

At this point, several backend team members are actively investigating and reviewing all our metrics again.

An incorrect understanding of AWS network bandwidth charts makes us feel our API instances have their network saturated, which could explain why all our endpoints are impacted.

Therefore, we double the amount of EC2 hosts in our ECS cluster and limit the number of containers running on each host to spread the bandwidth.

July 12

⚠️ Another occurrence of slow response time. The incident is once again posted on the Status page: Incident – Slow Response Times on Server.

We launch a task force, gathering experienced engineers from different teams to focus on solving the incident. Because we recognize the difficulties in understanding the causes of our problems with our existing tooling, this team also has a secondary objective to explore solutions to enhance our observability and tooling around emergency procedures.

We manually instrument our API pipeline to measure the execution of every single step:

  • Routing
  • Input parsing
  • Authentication
  • Rate limit
  • Handler

We soon discover that our bottleneck is the rate limit. Even though this is the first real clue in several weeks, we don’t understand why. We review the metrics of our rate limit cluster again, but nothing surfaces. At the moment, we’re still considering 20% CPU as safe. To confirm the rate limit is indeed the bottleneck, we disable it for a few minutes, and traffic appears to be fully restored.

A second action we take is to block some background synchronizations between devices with the hope of cutting some rate limit intensive flows. It adds some delay in the result of user actions, such as in the propagation of new credentials. But overall, stability is restored, and all endpoints are back to normal. This is the moment when we successfully mitigated the incident.

Without understanding the root cause yet, we have two known workarounds:

  • Option 1: Partially slow down our synchronization process, with some controlled impacts on customer experience
  • Option 2: Disable the rate limit

Because rate limit is an important mechanism of our security architecture, we implement Option 1 until we understand the root cause and fully remediate it for the long term.

Our backend gets back on track, and overall capabilities are restored, except for a slightly increased delay in some actions that trigger background synchronizations.

But one question remains unanswered: Why does our Elasticache cluster have such latency even when CPU, Memory, and Network consumption are fine?

July 15

We understand the root cause of the issue!

Reviewing all aspects of this cluster, we finally realize it is hosted on some cache.t3 nodes. We make the connection with the topic of CPU Credits and AWS burst features. Looking for CPU Credit charts, we realize they are NOT available in the Elasticache Monitoring overviews. We’ve spent the whole week looking at Elasticache’s built-in monitoring view, only to find out just now that it does not include the CPU Credits metric chart. Those metrics are only available through Cloudwatch itself.

Indeed, our CPUCreditBalance was at 0 when the incident occured.

We replace the cluster with a cache.m5 one and the problem is completely solved. Everything makes sense now:

  • Our traffic increased progressively over several years, up to the point we reached 20% CPU from time to time.
  • Each time we release a new version of one of our clients, it has a cost on the backend. The system was designed to handle it correctly. Except that this Redis cluster, being at its 20% limit too often, was not accumulating enough credits to cover the overhead generated by the releases, much less the more regular peak usage hours. 

July 31

We scale down everything we’ve scaled up for remediation: DB, APIs.

We replace all T3 Elasticache clusters with M5 ones.

Lessons learned

A word about CPU Credit and T3 instance family on AWS

By default, T3 instances are capped at 20% CPU. Over time, when you are below 20%, you accumulate some CPU Credits. When you reach 20% CPU, the instance enters a “burst” mode, consuming those CPU Credits to go above 20% CPU.

What happens when the instance runs out of CPU Credits is different on EC2 and Elasticache. On EC2, there is an “unlimited mode”, enabled by default, that will make you pay more instead of capping your machine at 20%. On Elasticache, however, there is no unlimited mode, and the instance will be capped at 20%.

About our rate limit T3 cluster

In 2019, when we spawned that cluster, we knew about the T3 CPU Credit constraint, but at that moment, the cluster was overscaled. Therefore, the T3 instance family was tempting since its cost is much lower than the M5 family, and we were far from that 20% limit. We never thought we would reach this threshold, which remained true for years, but in July, we finally hit that threshold.

We could have prevented July’s incident had we chosen a different node family back in 2019, but we still believe that our decision was appropriate given the context at that time.

Even if our original decision to use T3 was appropriate, we should have created alarms about Credit CPU usage, just in case one day, in several years, we ended up running out of them.

In retrospect, one hint we missed during our investigations was that our CPU usage had become progressively flat, which is the sign that it’s being capped under the hood. Below is a chart showing the CPU utilization going flat progressively until we replaced the cluster.

Now we have replaced all cache.t3 clusters with cache.m5 clusters. Given today’s context, paying for more CPU to avoid slow response times resulting from a lack of CPU credits makes sense.

Observability and APM

Without an Application Performance Monitoring (APM) system in place, finding the root cause of the incident took time because we had to connect information from different places by ourselves. Our new monitoring was triggering and alerting, and we were watching. But without observing the application performance as a whole, it was hard to spot the root cause of the incident.

We relied on per-component metrics to know how our platform behaved, effectively asking questions like, “What’s the status of the DB? What’s the CPU usage on that cache cluster?” But we weren’t measuring the actual time spent on those components during a request.

The real hint appeared when we finally added some manual instrumentation to our pipeline.

Good monitoring is useless on its own if you are thinking about observability and its practice. While monitoring will alert you on what you pre-defined as an important threshold, observability is what you really need to quickly debug and restore services. Observability is a set of tools and practices that give you an outside-in look into the operation and functioning of your systems. APM is just one piece of the observability puzzle. 

Elastic APM

We decided to invest in tooling that brings visibility to our bottlenecks and integrates an APM solution.

Many powerful APM tools exist, including NewRelic, Dynatrace, and Datadog. Since we were already leveraging Elasticsearch as our logging stack, we decided to go for Elastic APM, which provides it for free as long as you can host the infrastructure—which was already true in our case. The overall integration was fairly simple. It took us a few hours to set up an APM Server cluster, configure Logstash and Elasticsearch accordingly, and expose those APM servers to our backend. It also took us another few hours to embed the NodeJS APM agent and adjust the configuration. We find it especially useful that this agent is auto-instrumenting most well-known NodeJS libraries, like Express, MySQL, Redis, and more.

One question people often ask about Elastic APM is its overhead. Yes, it does add some overhead to our APIs, but we’ve found that the overhead is fairly negligible when configured correctly with sampling and filtering. In our case, the only noticeable impact was on Elasticsearch and Logstash, which had to store and process many more logs. 

But we’ve determined that is a tradeoff we’re happy with. Below is an example of analysis provided by Elastic APM where we can clearly see a call to Redis taking much more time than expected.

In conclusion

We regret the disruption our customers experienced due to the sporadic slow response times. At the same time, we’re confident that these particular issues won’t reoccur because the team has taken this opportunity to improve our systems and monitoring to become more resilient.

It was a good reminder about T3 instance families and how we should have immediately converted this CPU Credit constraint into a monitored asset.

Also, it was an efficient showcase of how important it is to invest time and resources into strong observability practices, particularly APM tooling. Happily, we discovered how surprisingly easy it can be to set up!

We value transparency, so when disruptions occur, we post them on status.dashlane.com to make customers aware. While we recognize that incidents can sometimes occur, we’re currently and continuously looking to improve so customers can always have a great experience using Dashlane.