Performance has always been an obsession of mine. I enjoy the challenge of understanding why things take as long as they do. In the process, I often discover that there's a way to make things faster by removing bottlenecks. Today I will go over some changes we recently made to Privy that resulted in our production application sending emails 150x faster per node!
Understanding the problem
When we starting exploring performance in our email queueing system, all our nodes were near their maximum memory limit. It was clear that we were running as many workers as we could per machine, but the CPU utilization was extremely low, even when all workers were busy.
Anyone with experience will immediately recognize that this means these systems were almost certainly I/O bound. There's a couple obvious ways to fix this. One is to perform I/O asynchronously. Since these were already supposed to be asynchronous workers, this didn't seem intuitively like the right answer.
The other option is to run more workers. But how do you run more workers on a machine already running as many workers as can fit in memory?
Adding more workers
We added more workers per node by moving from Resque to Sidekiq. For those who don't know, Resque is a process-based background queuing system. Sidekiq, on the other hand, is thread-based. This is important, because Resque's design means a copy of the application code is duplicated across every one of its worker processes. If we wanted two Resque workers, we would use double the memory of a single worker (because of the copy-on-write nature of forked process memory in linux, this isn't strictly true, but it was quite close in our production systems due to the memory access patterns of our application and the ruby runtime).
Making this switch to Sidekiq allowed us to immediately increase the number of workers per node by a factor of roughly 6x. All the Sidekiq workers are able to more tightly share operating system resources like memory, network connections, and database access handles.
How did we do?
This one change resulted in a performance change of nearly 30x (as in, 3000% as fast).
How did running more workers also result in a performance increase of 500% per worker? I had to do some digging. As it turns out, there's a number of things that make Resque workers slower:
- Each worker process forks a child process before starting each job. This takes time, even on a copy-on-write system like linux.
- Then, since there are now two processes sharing the same connection to redis, the child has to reopen the connection.
- Now, the parent will have to wait on the child process to exit before it can check the queue for the next job to do.
When we compounded all of these across every worker, it turns out these were, on average, adding a multiple-seconds-long penalty to every job. There is almost certainly something wrong here (and no, it wasn't paging). I'm sure this could've been tuned and improved, but I didn't explore since it was moot at this point anyway.
Let's do better - with Computer ScienceTM
In the course of rewriting this system, we noticed some operations were just taking longer than felt right. One of these was the scheduling system: we schedule reminder emails to be sent out in redis itself, inserting jobs into a set that is sorted by time. Sometimes things happen that require removing scheduled emails (for example, if the user performs the action we were trying to nudge them to do).
While profiling the performance of these email reminders, I noticed an odd design: whenever the state of a claimed offer changes (including an email being sent), all related scheduled emails are removed and re-inserted (based on what makes sense for this new state). Obviously, this is a good way to make sure that anything unnecessary is removed without having to know what those things are. I had a hunch: If the scheduled jobs are sorted by time, how long would it take to find jobs that aren't keyed on time?
It turns out that the time it took to send an email depended linearly on how many emails were waiting to be sent. This is not a recipe for high scalability.
We did some work to never remove scheduled jobs out of order - instead, scheduled jobs check their validity during runtime and no-op if there is nothing to do. Since no operations depend linearly on the size of the queue any more, its a much more scalable design.
By making this change, we saw an increase in performance of more than 5x in production.
- Moving from process-based to thread-based workers: ~6x more workers per node.
- Moving from forking workers to non-forking workers: 5x faster.
- Removing O(n) operations from the actual email send job: 5x faster.
- Total speedup: Roughly 150x performance improvement.