Privy Engineering

Building Privy.com

Database Concurrency, Part 1

This is part one of a series I’ll be writing about database concurrency. Since this first post is a broad overview, I have simplified many concepts here.

High performance databases must support concurrency. As in many other software systems, databases can use read/write locks to maintain consistency under concurrent use (MyISAM in MySQL does this, for example). Conceptually – this is pretty simple: 1) There can be multiple readers. 2) Readers block writers. 3) Writers block each other as well as readers.

Modern database concurrency control has taken this concept pretty far, and protocols like strict 2-phase locking can give you concurrency and strong serializability guarantees. However, any system that depends on this sort of concurrency control suffers an inherent scalability problem: read/write locks prevent readers and writers from running simultaneously – by design. As the volume of reads/writes scale up, you run into situations where you have unnecessary queueing, or either the readers or writers get starved. MyISAM for example prioritizes writes ahead of reads; get enough write volume and your reads will block forever, because write operations will perpetually “cut in line.”

There’s no easy solution here. Prioritize readers ahead of writers? Now you’re going to suffer the opposite problem [1]. Set up a read-only slave? Enjoy dealing with your replication lag. Sharding? That almost makes NoSQL look attractive.

Most reasonably large systems have a lot of read and write transactions going on at once, so its not something we can really sweep under the rug.

MVCC is MVP

An interesting concurrency strategy in modern database systems is called multiversion concurrency control (MVCC). The fundamental insight behind MVCC is that readers and writers in different transactions will never block each other if we allow reality to briefly diverge and converge. Why is this useful?

  • Every transaction starts isolated from every other transaction, they can all pretend they are the only ones running.
  • We can now perform multiple operations and commit them as an all-or-nothing operation, guaranteeing the operations succeed or fail together.
  • We can now read data from a consistent snapshot of the database, even as it continues to change in the background.

Allowing multiple versions of the database to exist simultaneously means we can provide all these guarantees under high concurrency. This is actually pretty incredible, if you think about it. But it’s not all rainbows and sunshine.

“Eventual” consistency? No, sorry, we need a production database

Astute readers will probably realize supporting multiple consistent versions of reality complicates a lot of things that would otherwise be simple. For example, here are just a few complications to account for read queries:

  • You can no longer quickly count the rows in a table. This seems to make many people both confused and angry, because it seems so unbelievably simple. The reality is that MVCC is very, very complicated. Remember, any number of transactions could have made INSERTS or DELETES that are currently invisible to your SELECT query, so the actual count depends on what operations are visible to the current transaction. The database index is no silver bullet because you still have to find and ignore all those pesky invisible rows.
  • In addition, a DELETE statement doesn’t necessarily delete a row, even after you commit. What if there is another transaction that started when that row still existed, and hasn’t finished yet?
  • UPDATE statements don’t update – it writes a new row. The old row has to be kept around for transactions that haven’t seen the update yet, or in case the transaction that wrote the UPDATE rolls back.
  • If you do anything involving range queries, such as SELECT * from accounts where balance > 1000, the database has to do all kinds of acrobatics with range locking, next-key locking, etc to ensure that gosh darnit, no insert or update operation in any other transaction can change this result until the transaction completes.

Which brings me to the elephant in the room: how to reconcile different versions of reality. Because eventually, you’re going to encounter the database equivalent of a merge conflict:

  • What happens if you try to UPDATE a row that a different transaction has updated with more recent data?
  • What is the most recent data? The most recent commit? The most recent uncommitted change?
  • How should constraint violations be handled? For example, what happens if two transactions try to claim the same unique username?

And the worst part of it is, under the default settings in Postgres and MySQL/InnoDB, these anomalies can silently corrupt your data with lost updates (an update gets reverted accidentally by a transaction that never knew about it) or write skew (two transactions read data and then write consistent updates that conflict when merged).

Next up: the different transaction isolation levels available in MVCC. Update: read part 2 here.

[1] Yes, you could use some sort of fairness algorithm, but that still doesn’t solve the queueing problem.

How we sped up our background processing 150x

Performance has always been an obsession of mine. I enjoy the challenge of understanding why things take as long as they do. In the process, I often discover that there’s a way to make things faster by removing bottlenecks. Today I will go over some changes we recently made to Privy that resulted in our production application sending emails 150x faster per node!

Understanding the problem

When we starting exploring performance in our email queueing system, all our nodes were near their maximum memory limit. It was clear that we were running as many workers as we could per machine, but the CPU utilization was extremely low, even when all workers were busy.

Anyone with experience will immediately recognize that this means these systems were almost certainly I/O bound. There’s a couple obvious ways to fix this. One is to perform I/O asynchronously. Since these were already supposed to be asynchronous workers, this didn’t seem intuitively like the right answer.

The other option is to run more workers. But how do you run more workers on a machine already running as many workers as can fit in memory?

Adding more workers

We added more workers per node by moving from Resque to Sidekiq. For those who don’t know, Resque is a process-based background queuing system. Sidekiq, on the other hand, is thread-based. This is important, because Resque’s design means a copy of the application code is duplicated across every one of its worker processes. If we wanted two Resque workers, we would use double the memory of a single worker (because of the copy-on-write nature of forked process memory in linux, this isn’t strictly true, but it was quite close in our production systems due to the memory access patterns of our application and the ruby runtime).

Making this switch to Sidekiq allowed us to immediately increase the number of workers per node by a factor of roughly 6x. All the Sidekiq workers are able to more tightly share operating system resources like memory, network connections, and database access handles.

How did we do?

This one change resulted in a performance change of nearly 30x (as in, 3000% as fast).

Wait, what?

Plot twist!

How did running more workers also result in a performance increase of 500% per worker? I had to do some digging. As it turns out, there’s a number of things that make Resque workers slower:

  • Each worker process forks a child process before starting each job. This takes time, even on a copy-on-write system like linux.
  • Then, since there are now two processes sharing the same connection to redis, the child has to reopen the connection.
  • Now, the parent will have to wait on the child process to exit before it can check the queue for the next job to do.

When we compounded all of these across every worker, it turns out these were, on average, adding a multiple-seconds-long penalty to every job. There is almost certainly something wrong here (and no, it wasn’t paging). I’m sure this could’ve been tuned and improved, but I didn’t explore since it was moot at this point anyway.

Let’s do better – with Computer ScienceTM

In the course of rewriting this system, we noticed some operations were just taking longer than felt right. One of these was the scheduling system: we schedule reminder emails to be sent out in redis itself, inserting jobs into a set that is sorted by time. Sometimes things happen that require removing scheduled emails (for example, if the user performs the action we were trying to nudge them to do).

While profiling the performance of these email reminders, I noticed an odd design: whenever the state of a claimed offer changes (including an email being sent), all related scheduled emails are removed and re-inserted (based on what makes sense for this new state). Obviously, this is a good way to make sure that anything unnecessary is removed without having to know what those things are. I had a hunch: If the scheduled jobs are sorted by time, how long would it take to find jobs that aren’t keyed on time?

O(n). Whoops!

It turns out that the time it took to send an email depended linearly on how many emails were waiting to be sent. This is not a recipe for high scalability.

We did some work to never remove scheduled jobs out of order – instead, scheduled jobs check their validity during runtime and no-op if there is nothing to do. Since no operations depend linearly on the size of the queue any more, its a much more scalable design.

By making this change, we saw an increase in performance of more than 5x in production.

Summing up

  • Moving from process-based to thread-based workers: ~6x more workers per node.
  • Moving from forking workers to non-forking workers: 5x faster.
  • Removing O(n) operations from the actual email send job: 5x faster.
  • Total speedup: Roughly 150x performance improvement.