Parallelizing and auto-scaling bcrypt

Damian Schenkelman
Auth0 Engineering
Published in
6 min readSep 20, 2016

--

How we greatly scaled our ability to perform login transactions

tl;dr

Keeping passwords safe involves using hash functions that are CPU-intensive and intentionally slow. This presents an interesting scaling challenge when having to handle thousands of login tx/sec.

Moving to an automatically scalable, distributed worker pool model allowed us to avoid bottlenecks, save money and not have to worry about variable load.

In the beginning

One of the features we have had for a really long time is what we call Database Connections. This means that instead of using Google, Facebook or an Active Directory instance as an identity provider, we store the emails/usernames and password hashes for users and they log in using those credentials.

This is one of our most used features. About 70% of users come from Database Connections. Making sure involved components are constantly optimized is always top of mind.

As our customer base started growing, performance was one of the things we had to pay more attention to. The more customers you have, the more varied their use cases and the higher load you get. Making sure we could handle that load became one of our top priorities.

In the extreme, bad performance can be interpreted as downtime.

Performance tests

We usually run many instances of each of our services in our cloud deployments. For the performance tests, we had decided to run with a single instance of each service to find the most elemental bottlenecks first.

When performing a test like this, an expected distribution for the response times usually:

  • Has a minimum greater than 0. No request takes 0 ms
  • Has relatively similar amounts for the lower set of response times
  • Has a constantly decreasing tail; some requests just take a bit longer.

A good example could look like this:

Example response time distribution

We set up a separate environment for the tests, created a couple of JMeter scripts to simulate users logging in and got this as the response time distribution:

Real response time distribution

As you can see, the shapes of the curves are quite different. In reality, the majority of the requests were taking a very long time to complete. This pointed to a clear bottleneck.

Enter bcrypt

Conceptually, our setup looks something like this:

  1. A client performs a request to authenticate against our Auth API.
  2. Auth0 verifies the user's credentials (active authentication)
  3. Eventually the client gets a token as a response.

Note: The database and caches are queried while the above is taking place.

The first thing you hope for in situations like these was that we were just missing an index and adding it would make everything better, but it just wasn't the case.

Our next hypothesis was that a (or set of) CPU-intensive task was blocking the event loop and causing requests to be queued. Instead of going through our entire code base, we used flame graphs to try to find the cause of the issue. Fortunately, we got something that looked like the one in the following figure when profiling Auth0:

Example flame graph

Most of our time was being spent on bcrypt. Bcrypt is the algorithm we use for hashing passwords. It is both memory- and CPU-intensive, intentionally slow, and the number of iterations it performs can be configured to adjust for faster cores in the future.

All of that is great from a security perspective, but a scaling and performance nightmare.

The alternatives

“King playing cards” by Enoch Lau is licensed under CC BY-SA 3.0
  1. Faster hash: This is a simple alternative (for new users) but was out of the question for us. We like the security traits of bcrypt and were not willing to give them up.
  2. Caching: This is does not fix the issue. We are trying to scale real logins, people don't login more than once in a while and there's no temporal locality. Caching would not save us this time.
  3. Scaling up: This was definitely the way to go for a quick win. You get more cores, have more processes running and you are good. The problem is that running in a cloud environment, you are going to hit a ceiling at some point.
  4. Scaling out: That's what you want in general, but what VMs do you use? If you have single-core VMs, you are not wasting any $, but every time one authentication request comes in, all other requests are stalled; that's not good :( If you get VMs with more cores and use async mode (which is basically using libuv's thread pool), you will have idle cores during periods when authentication requests are not coming in. This looks like a step in the right direction, but more atomicity seems to be better.

Introducing BaaS

We liked a lot of the benefits that solution #4 brought to the table, but wanted more control over the bcrypt operations. That's how we came up with BaaS.

We created a TCP service that uses Protocol Buffer to handle two types of messages:

  • Comparing a password to a hash
  • Hashing a password

Each BaaS instance is like a worker in a distributed thread pool (kind of "distributing" async mode). The trivially parallelizable nature of the aforementioned operations is an ideal fit for this model.

Auth0 nodes connect to a load balancer (actually 2 for HA purposes) and they distribute the load between the different BaaS instances.

Added 2016–09–21

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Note: the traffic between the Auth0 node and the BaaS nodes is encrypted. using TLS.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

High level BaaS setup

Since the characterizing metrics of the BaaS service are very specific (e.g., requests/sec, CPU usage), we can do more granular capacity planning and implement automatic scaling to handle variable levels of load.

Failing gracefully

Whenever a dependency with an external service is introduced, the possibility of failure has to be handled gracefully.

An excellent example of failing gracefully

There are different nonexclusive options, but their end goal is to prevent degradation of the users' experience:

  1. Retrying the operation against the same or different destination
  2. Falling back to an alternative implementation
  3. Returning an error

In our case we went with option #2. If a failure occurs when trying to call BaaS, we perform the hash calculation locally. This is not ideal, but gives time to our team to figure out what is going on and get a fix in place.

Nowadays

We are really happy with the outcome; BaaS has been running smoothly in production for some time now.

Since this is a critical component in our infrastructure, we frequently look for improvement opportunities. For example, we want to give Avro a try, as we have found some perf improvements when doing some tests internally.

The explanation of BaaS provided in this post is a simplified version. You can keep up with the latest changes in the github repo.

--

--