Privy Engineering


Updates to our list of excluded security issues

We’ve been excited to receive a number of vulnerability reports from security researchers all over the world since launching our security disclosure page earlier this year, and we’ve learned a lot about this process since we’ve published it. Privy is a more secure platform today because of the many reports we received.

One thing that’s become clear is that we need to help researchers understand what we consider to be valid reports. The first version of our security page made a reference to two classes of excluded issues: XSS/Self-XSS and campaign asset information disclosure. We wanted to nip those in the bud, since we have received a steady stream of reports since even before we had an official channel to report security issues. I’d like to briefly elaborate on why these two issue types are excluded:

XSS/Self-XSS: Many parts of the Privy dashboard expose advanced inputs that allow JavaScript code to be entered, and this code executes under various conditions. If a user enters malicious code in these inputs, then bad things can happen. However, this requires that an authorized user deliberately enter the code himself, so it is not a security exploit. The only users the code can affect are other users on the account, and the merchant’s own website visitors.

Campaign asset information disclosure: We occasionally receive security reports that campaign metadata or assets can be read under various unintended conditions (for example, getting a coupon code by inspecting network requests, instead of actually signing up for an offer with an email address). A merchant’s promotional campaigns are intended for public consumption, so we don’t make any guarantees about their privacy. Thus, exposure of assets like photos, coupon codes, free downloads, etc are not eligible for a security bounty or acknowledgment on our credits page.

These long-standing policies have served us well by allowing us to focus our time and energy on security vulnerability reports that are likely to pose the biggest threats to our merchants and our platform. Today, we’re updating the bounty exclusions with five new categories: theoretical attacks, brute force attacks, attacks requiring physical access, account email management, and clickjacking.

In short, we’ve found that reports that fall in these categories tend to offer low returns on our effort, and sometimes run counter to our business goals. For example, we’ve seen multiple times that closing the reported vulnerabilities required breaking third party integrations, or important scenarios for many of our customers. Many reports in these areas targeted a series of deliberate and well-considered design tradeoffs we made, not an oversight. In others, we considered the likelihood of the various attack scenarios, the severity of the issue, and the potential breadth of the impact, and concluded the risk was small enough to ignore (attacks that require gaining physical access to a victim’s device fall under this category).

There are valid scenarios in which combining multiple vulnerabilities can lead to actionable issues. Our goal is to evaluate those reports on a case by case basis, but we will generally not act on a straightforward vulnerability that falls under one of the above listed categories. We hope that these updated guidelines save time — for both our team and researchers. Happy hunting.

Our Commitment to Candidates

Interviewing for a startup job can be grueling, confusing, and demoralizing even when the process is going smoothly.

Last year, we made an internal effort to adopt more of the practices that we, as candidates, would like to see. Now that we’re hiring again, we’re publicly committing to these standards:


  • We’re going to give all candidates the opportunity to present their best selves. We understand not all candidates work well under high pressure live coding exercises or whiteboarding problems. Therefore, every candidate will have the option to assemble code samples, personal project demos, open source contributions, or take home problems to present as part of a job application.
  • We’re going to remove some biases from our system – this includes anonymizing schools (and in some cases, prior companies) so we can focus on the training and experience that a candidate has.


  • We’ll do our best to personally respond to every applicant. This includes applicants that we decline to interview for an initial screen.
  • Once you begin interviewing, we’ll do our best to communicate at least once every 7 days, so you always know where you stand.

These commitments will consume time and energy. It will be harder to do than not do, and therefore require conscious effort. At the end of the day, we want Privy to be known for being a great place to work and interview. We hope that this gives all candidates the confidence to apply.

November 25, 2016 Outage Postmortem

On Friday, November 25th, beginning at 1:32PM eastern US time, the platform suffered an outage lasting roughly 3 hours.

During this incident, the Privy dashboard was completely unavailable – including signing up for a new account, logging in to an existing account, and managing your campaigns.

A large proportion of campaigns (fluctuating between ~20% and ~70%) failed to load on our customers’ websites. When they did load, they often took up to 30 seconds to do so. Users could opt into these campaigns, but form submissions were slow to process and returned with an error message, even though the submissions were successful. Thank-you pages did not display on a successful signup.

Our email and contact sync systems were unaffected. All successful signups synced to their configured destinations, and emails (autoresponders and scheduled sequences) sent as usual to their recipients.

The proximate cause of this issue was that our database systems were overwhelmed. The engineering team at Privy made preparations for the Black Friday weekend, resulting in roughly 4 times the usual computing resources being available. However, there were a few unanticipated performance problems that became magnified under the stress. In addition, a bug caused by incompatible third party code resulted in a subset of our accounts unintentionally sending up to 20 times more activity data than they should have. Together, these issues generated a workload that our systems could not handle.

Privy engineering immediately investigated these issues and deployed emergency workarounds to restore full availability. By about 4:41PM, our systems began to recover. Today, I am happy to report that all of these workarounds have been removed, and that the identified performance issues have been addressed.

However, in our focus to solve the issue at hand, the engineering team initially failed to communicate the impact, expected time to resolution, and other important details of the incident to our customer support team, which resulted in a lack of details, contradicting information, and an overall frustrating experience for both our support team and customers.

Here are the things we have done to ensure this doesn’t happen again:

  • Updated our incident handling documentation to more quickly identify, communicate, and resolve common problems.
  • Changed our engineering roadmap to ensure that in the future, we can broadcast important news and status updates to our customers, instead of in one-on-one conversations.
  • Significantly improved key bottlenecks in our platform to handle more load concurrently.

Despite all our preparations, we fell short on one of the most important days of the year, and we’ll do everything we can to ensure that this doesn’t happen again. Thank you for using Privy.

Intercom Conversation Stats: an Open-Source Tool by Privy

This is a guest post by Andrew Knollmeyer.

Screen Shot 2016-08-02 at 4.19.34 PM

Introducing Intercom Conversation Stats, a tool developed by Privy which is free for anyone to use! This app allows you to gather information about your conversations in Intercom and store it in a Google Sheets document on a regular basis. The provided build aggregates data on conversation tags, but it can be customized to work with any other data from your conversations as well.

Screen Shot 2016-08-02 at 4.17.00 PM

Why Build it?

When we first started using Intercom, we were onboarding 5-10 new users per day. With our “all hands” approach to support, it was easy to quantify issues and feature requests that were coming up in support chats. But that didn’t last long.

As we continue to scale, we’re now onboarding 250+ new users per day, and chats have climbed to 30 or 40 per day. Along the way, it’s become a lot harder to quantify and truly internalize trends in user feedback as these numbers continue to grow.

This tool was designed in order for us to better evaluate which areas of our product need to be worked on. During development, we realized that the support for this kind of system was not all there, and that no one else had publicly released a tool which makes this sort of thing easy to build, so we decided to make it open-source as well.

How Does it Work?

Intercom Conversation Stats is a Rails app which integrates with the Google and Intercom APIs, and uses Sidekiq to automate all of its processes. First, a webhook is sent from Intercom to the application whenever a new conversation is created by one of your users. Intercom Conversation Stats then stores the ID of that conversation in a table, so that conversation’s data can be accessed later. By default, at the end of the week, the application pulls the tags from each of this week’s conversations and counts how many times each tag appeared. The count for each tag is then stored in its respective row in Google Sheets, along with the percent of tag mentions each tag accounted for.

How Can I Use it?

The GitHub repository can be accessed here. Instructions on setup and customization can be found in the README on that page. If you would like to request instructions for setup on a platform other than Heroku, feel free to contact us via

Building a BellBot

Ever feel like ringing a bell requires too much effort? Ever wish you could automate it to ring when something – like a sale – happens? If you responded “yes” to at least one of these questions, fret not. There is now a solution: BellBot.

This tutorial will guide you through building your own BellBot to automate bell ringing! The BellBot system is made of:

  1. A node.js server hosted on a local machine.
  2. An Arduino Uno connected to the machine via USB.
  3. servo connected to the Arduino.
  4. An ordinary call bell.

Required Hardware

Arduino Uno

The Arduino Uno is a microcontroller board based on the ATmega328P chip. It’s a popular board due to its low cost, ease of use, open-source nature, and strong community. We will use it as the interface between our node.js server and the servo.


servo is a device that has an output shaft whose angle can be precisely controlled through an input line. Our servo will be responsible for striking the bell. To do so we attach an arm to the servo’s output shaft and align its end with the bell, as seen in the video. Using a pencil for the servo’s arm is a fast and easy option for prototyping.

Breadboard & Jumper Cables

A small basic breadboard and some jumper cables are perfect for fast prototyping. In this project we’ll be using the power rails to provide common +5V and GND (ground) nodes.


We’ll be powering the system via USB in this project. Thus, adding smoothing capacitors should help stabilize the circuit and protect against current spikes. A 200uF capacitor works well.

Set up the circuit

Follow the diagram and schematic below to set up the circuit.

  1. Connect the servo’s Vin to the Arduino’s +5V pin.
  2. Connect the servo’s GND and any of the Arduino’s GND pins to a common ground.
  3. Connect the servo’s control line to one of the digital I/O pins on the Arduino. This tutorial uses pin 8.
  4. Add a capacitor or two in parallel to the +5V pin and the common GND to protect against possible current spikes. Remember that capacitors, unlike resistors, add up in parallel.
  5. Finally, connect the Arduino to your computer with A USB cable. This will power up the circuit and enable communication with the host computer.

Diagram for a USB-powered BellBot.

Diagram for a USB-powered BellBot.

Schematic for a USB-powered BellBot.

Schematic for a USB-powered BellBot.

Note: If you’d rather power the system through an external power supply, read Using an External Power Supply under the Notes section.

BellBot in action

BellBot in action

Write the Arduino code

The Arduino code will consist of the standard firmata example included in the IDE. The Arduino IDE is the environment in which we write and upload sketch code to the Arduino Uno. You can download the IDE here.

  1. In the Arduino IDE, go to File > Examples > Firmata to find the standard firmata code. The code implements a communication protocol between the Arduino and any host computer. With this protocol and the Johnny Five module, we can control the Arduino within node.js.
  2. Save the standard firmata code and then select your board from Tools > Board
  3. Finally, upload the sketch to the Arduino.

Writing our own Arduino sketch to control the servo is also an option. But, we would lose the abstraction and simplicity of controlling the hardware within node.js. Checkout Sketch code under the Notes section if you’re interested in writing your own Arduino code.

Write the node.js code

Install node.js

We will be using node.js as our local web server. It will listen for incoming POST requests to the /ring route and relay them to the Arduino. Node.js describes itself as “a JavaScript runtime built on Chrome’s V8 JavaScript engine. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient. Node.js’ package ecosystem, npm, is the largest ecosystem of open source libraries in the world.” You can read more about it here.

  1. If you haven’t already, download Node.js here.

Install project dependencies

  1. Once node.js is installed, run the following command in your project’s directory to create a new package.json file. This file maintains info about the project and its dependencies.
    npm init
  2. Next, install the following node modules into the auto-generated local node_modules folder. By including the --save option in the following commands, npm updates package.json automatically.
    express – A popular framework we will use to set up a web server.
    npm install express --save
    body-parser – A tool used by express to parse requests’ bodies
    npm install body-parser --save
    ngrok – A tool to tunnel localhost to a public URL hosted by ngrok
    npm install ngrok --save
    johnny-five – A vast library to control the Arduino and hardware within node.js
    npm install johnny-five --save

If you cloned the latest BellBot code, you only need to run,

npm install

npm will look at the package.json file and install all the required modules.

Write args.js

This module parses command line arguments that we supply. Optional arguments include:

  • proto – the application protocol to use. Default: ‘http’
  • port – the networking port number for the server. Default: 3000
  • start – the starting and resting angle of the servo arm. Default: 75.
  • strike – the angle of the servo arm when striking the bell. Default: 90
  • pin – the arduino output pin that controls the servo’s angle. Default: 8
  • sdelay – the amount of time in milliseconds to wait before resetting the servo’s arm after it has struck the bell. Default: 500.
  • pdelay – the amount of time in milliseconds to wait before processing the next command. Default: 1000
  • key – the optional key to authenticate request with. Default: null

If the BELLBOT_KEY environment variable exists and no key is supplied in the arguments, the server will authenticate incoming requests using the BELLBOT_KEY. We explain how to set up a BELLBOT_KEY later.

Write config.js

This module creates an object that describes the BellBot’s behavior. It uses our args.js module to override default settings. Note that the URL is initially blank. We later set this URL to the unique public URL that ngrok supplies us.

Write servo_controller.js

This module provides a constructor for a servo controller. Note that the controller’s timeout property corresponds to the processing delay – that is, the delay between sending requested commands.

Write index.js

This is the main entry point of the app. The file:

  1. Require‘s all the needed components.
  2. Connects to the Arduino.
  3. Starts up an ngrok process.
  4. Starts the express server.

Add a .bellbotrc file

If you want to add some basic authentication to verify requests with a password, create a .bellbotrc file in your home directory with a specified BELLBOT_KEY:

  1. Create the .bellbotrc file:
    touch ~/.bellbotrc
  2. Add the following line to the .bellbotrc file:
    export BELLBOT_KEY=<your_password>
  3. In the command line, run
    source ~/.bellbotrc
    to source the newly added environment variable.

Next time you start your BellBot server, it will authenticate requests with that specified password. The incoming POST requests must have a key or token body parameter that matches the password.

Test it out

Boot up the BellBot by running

node index.js

in the command line.

Pass optional arguments if you wish. For example,
node index.js port=3001 pin=9 key=my-secret-key

will start the BellBot with:

  1. The server listening on port 3001
  2. The arduino outputting on digital I/O pin 9
  3. Authentication using my-secret-key as the password

It is important to make sure BellBot is behaving properly before unleashing it upon the world. After starting your BellBot,

  1. Use curl or a program like Postman to build and send test POST requests. If you are authenticating requests, pass token: <BELLBOT_KEY> as a body parameter.
  2. Test locally by sending POST requests to localhost from your machine.
  3. Next, verify that the public address supplied by ngrok is hitting your local server by using the ngrok URL as the POST destination.

Make it a Slack-controlled BellBot

A neat way to automate bell-ringing is to use Slack. In slack, you can add and configure outgoing webhooks. For example, let’s add a webhook that sends a customized POST request whenever a message beginning with :bell: comes in.

Creating a new Slack Outgoing WebHook.

Creating a new Slack Outgoing WebHook.

To do this:

  1. Create a new outgoing webhook: Apps & Integrations > Manage > Custom Integrations > Outgoing WebHooks > Add Configuration.
  2. Select the slack channel that the webhook will be listening in. For example, a “sales” or “bots” channel.
  3. Set the trigger word that will cause the POST request to be sent. For example, the :bell: emoji.
  4. Set the URL to the public ngrok URL that maps to your local server.
  5. If your BellBot is configured to require authentication, set the Token to your secret BELLBOT_KEY.

Now all that’s left is to make sure that the specified channel receives an automated message that triggers the webhook whenever a sale occurs.

A sales Slack channel that listens for incoming :bell: messges.

A sales Slack channel that listens for incoming :bell: messages.



A servo uses error-sensing negative feedback to correct and maintain its shaft’s angle. The signal sent to the input is a PWM voltage signal. The width of the pulse is what controls the angle of the shaft. Check out this Seattle Robotics guide for more details.

Using an External Power Supply

Servos are notorious for drawing lots of power. When using USB power only, the 5V output on the Arduino is what powers the servo.The high current drawn during the servo’s rotation may exceed the limits of the USB ports (often ~500mA at 5V). When the current exceeds this threshold, the host computer may disable the port. Additionally, the Arduino may reset and behave erratically due to voltage sags. Check the servo’s data sheet for its idle, running, and stall currents. You want them to be low enough to be supplied via USB. Adding a capacitor in parallel to the 5V output and ground can help protect against these undesired effects.

To avoid these concerns altogether, follow the diagram and schematic below to set up an externally-powered BellBot.

  1. Power the servo and Arduino through an external power supply. Four AA 1.5V batteries provides ~6V, which should be enough to power both components reliably.
  2. Connect the voltage source’s output to the Vin input on the Arduino.
  3. Connect the voltage source’s output to the Vin of the servo.
  4. Connect the voltage source, servo, and Arduino to a common ground.
  5. Connet the servo’s control line to one of the digital I/O pins on the Arduino.
  6. Connect the Arduino to your computer using a USB cable. The USB connection’s role is reduced to serial communication. The Arduino will use the battery power as long as it’s switched on, even when the USB is connected. If you switch off the battery, the USB connection will power the Arduino but not the servo.
Diagram for BellBot running on an External Power Supply.

Diagram for BellBot running on an External Power Supply.

Schematic for BellBot running on an External Power Supply.

Schematic for BellBot running on an External Power Supply.

Sketch code

An alternative option for BellBot’s software is to write your own Arduino code. For example, the following sketch listens for the ‘R’ character to ring the bell.

The node.js server code would need refactoring. It would no longer require the whole Johnny Five module, and would use just the Serialport module to send data to the Arduino in a more barebones fashion.

Estimated cost

Item Cost ($)
Arduino Uno 24.95
Feetech Servo 12.95
Breadboard 4.95
Jumper Pack 1.95
Capacitor Pack 6.95
Total 51.75

Next Steps

Enhance your BellBot by implementing these ideas:

Add more routes

More routes to your express server would expand the possible types of requests. For example, a /sweep route could cause the servo to begin sweeping. Or, if there are many bells and servos, a REST API could allow a user to control each servo.

Create a web interface

A web app or mobile phone app for this project could allow users to monitor or control the states of the servos. Admins or users could analyze or act on statistics and data logs of requests.

Use more hardware

LEDs could light up whenever the servo is working or a request comes in. A network of servos could handle different types of requests. Light sensors could control the operation state of the servos (on if the lights are on). These are just some ideas!

Build a black box

Rather than exposing the system as a prototype, you could build a professional PCB and enclosure to abstract away the system.

Use the Raspberry Pi

It should be straightforward to port this project to a Raspberry Pi. The node.js server would run on the Pi itself, with the servo connected to its GPIO pins. No USB connection would be necessary.

Further Reading


Excuses not to Test

At Privy, one of our values is pragmatism – and that means avoiding dogmas that are inflexible and impractical. So naturally, we don’t enforce test-driven-development or 100% unit test coverage, even if those things are valuable in the abstract.

OK, so Privy is “pragmatic” and therefore doesn’t require formal proofs of correctness and all-du-paths coverage to check in code, because it’s not cost effective. This is such a common reaction to such a widely accepted belief that it essentially conveys no information at all; outside of extraordinary operations (like NASA), no one operates like that anyway. So how do we determine the what and how much to test?


How do I know if it needs tests?

First, it helps to understand some models we use to form the basis of our intuitions. The most important is the Pareto Principle: the general observation that — most of the time — 80% of a result comes from 20% of the effort. This means we should concentrate our testing on areas that are most likely to make development more efficient, and prevent the majority of the really nasty bugs. Naturally, that means things like complicated but critical core classes, and modules that are emotionally fraught or have difficult-to-untangle side effects like billing and subscription management.

The second is the idea that testing is (among other things) a form of protection from downside risks – lost users/customers, negative press/brand value, bad builds that waste engineer time, etc. Like insurance, you want your protection to be proportionate to the amount of downside risk you are exposed to. A young startup with very few customers/revenue/engineers has, in absolute terms, very little downside. It should act accordingly. Sometimes, at Privy that means we have a lot of code that we haven’t gotten around to testing yet, or even consciously decide that it will not be tested for the foreseeable future.

Third, there is a lot of context to consider. How fault-tolerant are your users? How experienced is your team? Are these variables going to trend up or trend down over time? Generally, a less experienced engineer should write more tests, for the same reason an inexperienced driver should be more deliberate and unfailing in using turn signals. It will also have beneficial side effects, like making it obvious what code is hard to test, and therefore highly likely to be architecturally suspect. On the other end of the spectrum, if you are a star engineer tasked with inventing the modern internet and have six weeks to do it, you will probably decide to skip a unit test here or there.

“You mean you don’t have 95% code coverage as a minimum requirement?” some say indignantly, “That’s not software engineering – that’s hobbyist amateurism; I hope you’re just building toy projects.” Actually, our enterprise marketing platform powers hundreds of millions of monthly page views, and is growing at double digit percentages on slow months.

But enough of that. Now we have a high level mental model to use as a general framework for deciding at whether tests will be useful; below, I’ve put together a list of some finer grained risk factor dimensions that may be useful for evaluating specific modules or classes.

What are the risk factors?

Note that many of these dimensions are not truly independent variables – erring on the side of completeness, there are sure to be some in this list that overlap or are causally related:

Maturity. How mature is the product?
Newer products, services, modules and classes can probably get away with less testing, as a result of the large amount of churn that is likely in the code, and the relatively lower downside risk – fewer users, downstream dependencies, etc.

Impact. If it were to fail, how bad would the result be?
Financial, reputational, or otherwise. Problems arising from defects that are hard or impossible to unwind (e.g., security compromises or data loss) deserve extra scrutiny. Maturity and impact go hand in hand – most young enterprises need to be focused on solving problems and building value, rather than protecting what little they have from downside risks.

Release Cadence. How quickly can an identified defect be resolved in production?
The faster you can deploy changes, the less risk overall that any given bug will have a material impact (some exceptions apply…think security). Continuous integration and deployment with highly effective test coverage is the surest way to have a fast release cadence, making defects in production less impactful.

Downstream dependencies. How many other modules/services depend on it?
More downstream dependencies means more risk, due to the coordination problem. It also implies your interfaces are stable and thus cheaper to test.

Upstream dependencies. How many other modules/services does it depend on?
More unstable upstream dependencies means more risk of breakage. If your dependencies are themselves not well tested, then testing “transitively” might be worthwhile.

Noisiness. If it were to fail, how soon would you notice?
Silent failures that take longer to detect deserve more scrutiny, because it’s usually harder to correct something the more time passes before it is discovered. Logs are rotated out, servers come out of service, repro steps are forgotten, etc.

Churn. How often is the code changing?
The more code is likely to change and be refactored, the more likely bugs will inadvertently be introduced. Conversely, the more interfaces are likely to change and be refactored, the more expensive tests will be to maintain, potentially tipping the cost/benefit equation.

Wrapping up

This is not meant to be an indictment of testing. Testing is — and should be — part of what it means to develop software professionally. This is more an attempt to formalize the real tradeoffs we’re balancing every day under heavy pressure, rather than providing a neat set of post-hoc rationalizations.

On the contrary, I think this helps highlight when we are making excuses instead of well-reasoned judgments — I’ve sometimes fallen into the trap of using one criteria from this list to argue for one approach or another, while ignoring another one that didn’t support my position. I hope that by laying out and curating this list over time, we’ll be able to make more balanced and consistent decisions around testing.

Database Concurrency, Part 2

This is part two of a series on database concurrency. Read the introduction at Database Concurrency, Part 1 .

Last time, I talked about multi-version concurrency control, or MVCC, and how it enables highly concurrent database systems while still guaranteeing transaction isolation. This is because MVCC allows reality (from the perspective of two distinct transactions) to diverge, giving us the unique advantage that readers and writers don’t have to block each other. But how does it achieve this in practice, and what are the caveats?

Let me step back a bit and define some terms:

Transaction: A unit of work with well defined start and end boundaries, composed of a number of operations.

Isolation: The property that makes concurrent transactions appear as if they were executing serially.

Isolation in practice

Because the isolation property requires concurrent transactions to appear as if they are executing serially (one after the other), they must not interfere with each other by definition. Unfortunately, isolation in SQL is not a boolean property, but one of degree; some workloads don’t benefit from full isolation, which is pretty slow. Here are some common isolation levels:

Read uncommitted: I like to think of this as “effectively, no isolation” because dirty read is allowed: uncommitted changes are globally visible. For example, you might have transactions:

Subtract $50 from A
Add $50 to B
Read balances from A and B

In this case its possible T2 executes in the middle of T1, and finds that $50 has disappeared without explanation. This makes it nearly impossible to reason about anything under concurrency.

Read committed: This level disallows dirty read, so it can only see committed transactions. However, it still allows non-repeatable read, an anomaly in which a transaction re-reads data and find that it has changed, if another transaction committed changes in between the two reads.

Repeatable read: This level disallows non-repeatable read. Unfortunately, it still allows phantom read, a different phenomenon where the same SELECT might return a different set of rows.

Wait – “non-repeatable” and “phantom” reads?

These two anomalies might at first seem identical, but according to the ANSI SQL spec, a read is “repeatable” or “non-repeatable” at the row level. Repeatable reads guarantee the same row will always have the same data.

A “phantom” read is a phenomenon at the result set level. This means the set might be different, even if no single row has changed. The simplest case is when an INSERT is committed in another transaction, causing it to be returned in a new read query.

Confusion here is justified, because this particular definition of “non-repeatable read” is not obvious: repeatable reads do not guarantee repeatable results. And there are a lot of sources where “phantom read” is said to be a special case of “non-repeatable read,” including Wikipedia[1], in contradiction to the spec. Aren’t databases fun?

Sometimes you need a sledgehammer

Ideally, we’d have an isolation level where dirty read, non-repeatable read, and phantom read weren’t allowed. So where does this lead us?

The good news: it turns out there are actually two distinct isolation levels that guarantee this, snapshot isolation and serializable.

The bad news: disallowing all the above anomalies is not enough to guarantee serializability, which is why there are two. We used to think snapshot isolation would prevent read anomalies, until some folks proved it couldn’t. Also, both Oracle and PostgreSQL have at one point or another called snapshot isolation “serializable,” even though it isn’t, and the SQL spec itself seems assume that preventing these three phenomena is equivalent to serializable isolation, even though this is easy to disprove.

So far this seems pedantic, so let’s look at an example. Imagine two empty tables A and B, and two concurrent transactions that insert a count of the rows in the other table:

INSERT into B count(*) from A;
INSERT into A count(*) from B;

If these two transactions run at the same time under snapshot isolation, they will both insert a single row with a 0. This makes sense, because each transaction has its own snapshot of the database, in which it sees the other table is empty. However, of the two possible serial orderings [T1,T2], and [T2, T1], neither of them is consistent with what actually happened. So these transactions are not serializable; but they also didn’t suffer any of the three anomalies we’ve defined so far. =(

We’re seeing a new anomaly not defined in the spec: write skew.

Enforcing isolation

The ANSI SQL isolation levels don’t actually prescribe how to implement each level, it only describes “prohibited phenomena.” In practice, most databases use a combination of version tracking and row locking. And most row locking implementations use some variation of reader-writer locks, which are common in many software systems. Readers don’t interfere with other readers, so they don’t block each other, but they block writers. Writers block readers, as well as each other.

Here is a table that shows when locks are needed at each isolation level, when using lock-based concurrency control:

Isolation level Write Operation Read Operation Range Operation
Read Uncommitted Statement Statement Statement
Read Committed Until Commit Statement Statement
Repeatable Read Until Commit Until Commit Statement
Serializable Until Commit Until Commit Until Commit

As you might have guessed, stronger isolation guarantees need more locking, decreasing performance. Note that at the lowest level of isolation, we release locks after every statement, which is very performant, but as we’ve seen, extremely unsafe. At the high end of serializable isolation, every lock we acquire to execute a statement is held until commit. This strategy has a special name: two phase locking.

To guarantee serializability, two phase locking has (you guessed it) two phases: an acquiring phase (releasing locks is not allowed), and a releasing phase (acquiring new locks is not allowed). Strict two phase locking goes further, and waits until the very end to release all locks at once during commit, which is helpful for preventing cascading aborts, although it costs concurrency and is more prone to deadlocking.

The big picture

The point of all this is that transaction isolation in MVCC is not easy to get right. Even at snapshot isolation — the highest[2] isolation level provided by some databases — it’s easy to run into bugs that will silently corrupt your data.

One mindset is to (stick your fingers in your ears and) 1) allow data to be randomly corrupted, or 2) forfeit performance by running your database under true serializable isolation. The other is to solve these problems at the application level, by deliberately structuring your transactions to avoid these problems.

Next up: How to work at read committed / repeatable read isolation and still prevent common anomalies at these levels!

[1] And some databases even treat “repeatable read” as equivalent to “serializable.”
[2] Technically though, snapshot isolation is not a superset of repeatable read, because SI allows anomalies that do not occur under RR. See “A5B Write Skew” in A Critique of of ANSI SQL Isolation Levels.

Reactive Systems: Part 1 – An Overview

This post is Part 1 of our series on Reactive Systems.

At Privy, many of our services are fundamentally event-driven. Indeed our core product value lies in helping merchants capture arbitrary user interaction and reacting to opportunities as they arise in a tangible and timely manner.

A key criterion for new components and systems at Privy is that they must be elastic. We must be able to out-scale our fastest growing merchants, if we are to continue to provide an acceptable level of service.

In addition to scalability, our systems must be fault tolerant or resilient in that failure of one component should not affect the overall integrity of our system.

Reactive Systems

Systems that are responsive, resilient, elastic and message-driven are also known as Reactive Systems. The Reactive Manifesto, a community distillation of best-practices, provides a concise vocabulary for discussing reactive systems.

Systems built as Reactive Systems are more flexible, loosely-coupled and scalable. This makes them easier to develop and amenable to change. They are significantly more tolerant of failure and when failure does occur they meet it with elegance rather than disaster. Reactive Systems are highly responsive, giving users effective interactive feedback.

Being Reactive in Practice

It is important to note that the Reactive philosophy is independent of any specific application layer; these general requirements can be realized throughout the stack.

Indeed there are many open source frameworks that can used to build Reactive systems. Front-end examples include Backbone.js, Facebook’s React, and Elm. These specific examples essentially handle input events and their effects as process networks.

A similar but distinct concept is the Actor model, which often arises in the context of highly concurrent and distributed background operations. The Actor model is a computational model that treats computational entities as primitives called actors. Actor behavior is defined by the ability to respond to messages received from other actors, the ability to send messages to other actors, and the ability to spawn new actors.

Actors saw much success in Erlang, a language originally designed for building telecommunication systems. For use-cases more specifically related to web applications, two popular Actor-based frameworks are Akka and Celluloid.

Celluloid is the underlying actor system used in Sidekiq, a background task framework for Ruby. Sidekiq is an integral component of the Privy backend – most of our asynchronous Ruby behavior occurs within Sidekiq Workers.

We’re also within the early stages of deploying an Akka app for Business Intelligence, which happens to be the primary motivator for this series of blog posts.

Handling Streams of Data: Part 2

In the next post in this series, we’ll examine methods for dealing with streams of data in an asynchronous and reactive, specifically responsive and elastic, manner.

Database Concurrency, Part 1

This is part one of a series I’ll be writing about database concurrency. Since this first post is a broad overview, I have simplified many concepts here.

High performance databases must support concurrency. As in many other software systems, databases can use read/write locks to maintain consistency under concurrent use (MyISAM in MySQL does this, for example). Conceptually – this is pretty simple: 1) There can be multiple readers. 2) Readers block writers. 3) Writers block each other as well as readers.

Modern database concurrency control has taken this concept pretty far, and protocols like strict 2-phase locking can give you concurrency and strong serializability guarantees. However, any system that depends on this sort of concurrency control suffers an inherent scalability problem: read/write locks prevent readers and writers from running simultaneously – by design. As the volume of reads/writes scale up, you run into situations where you have unnecessary queueing, or either the readers or writers get starved. MyISAM for example prioritizes writes ahead of reads; get enough write volume and your reads will block forever, because write operations will perpetually “cut in line.”

There’s no easy solution here. Prioritize readers ahead of writers? Now you’re going to suffer the opposite problem [1]. Set up a read-only slave? Enjoy dealing with your replication lag. Sharding? That almost makes NoSQL look attractive.

Most reasonably large systems have a lot of read and write transactions going on at once, so its not something we can really sweep under the rug.


An interesting concurrency strategy in modern database systems is called multiversion concurrency control (MVCC). The fundamental insight behind MVCC is that readers and writers in different transactions will never block each other if we allow reality to briefly diverge and converge. Why is this useful?

  • Every transaction starts isolated from every other transaction, they can all pretend they are the only ones running.
  • We can now perform multiple operations and commit them as an all-or-nothing operation, guaranteeing the operations succeed or fail together.
  • We can now read data from a consistent snapshot of the database, even as it continues to change in the background.

Allowing multiple versions of the database to exist simultaneously means we can provide all these guarantees under high concurrency. This is actually pretty incredible, if you think about it. But it’s not all rainbows and sunshine.

“Eventual” consistency? No, sorry, we need a production database

Astute readers will probably realize supporting multiple consistent versions of reality complicates a lot of things that would otherwise be simple. For example, here are just a few complications to account for read queries:

  • You can no longer quickly count the rows in a table. This seems to make many people both confused and angry, because it seems so unbelievably simple. The reality is that MVCC is very, very complicated. Remember, any number of transactions could have made INSERTS or DELETES that are currently invisible to your SELECT query, so the actual count depends on what operations are visible to the current transaction. The database index is no silver bullet because you still have to find and ignore all those pesky invisible rows.
  • In addition, a DELETE statement doesn’t necessarily delete a row, even after you commit. What if there is another transaction that started when that row still existed, and hasn’t finished yet?
  • UPDATE statements don’t update – it writes a new row. The old row has to be kept around for transactions that haven’t seen the update yet, or in case the transaction that wrote the UPDATE rolls back.
  • If you do anything involving range queries, such as SELECT * from accounts where balance > 1000, the database has to do all kinds of acrobatics with range locking, next-key locking, etc to ensure that gosh darnit, no insert or update operation in any other transaction can change this result until the transaction completes.

Which brings me to the elephant in the room: how to reconcile different versions of reality. Because eventually, you’re going to encounter the database equivalent of a merge conflict:

  • What happens if you try to UPDATE a row that a different transaction has updated with more recent data?
  • What is the most recent data? The most recent commit? The most recent uncommitted change?
  • How should constraint violations be handled? For example, what happens if two transactions try to claim the same unique username?

And the worst part of it is, under the default settings in Postgres and MySQL/InnoDB, these anomalies can silently corrupt your data with lost updates (an update gets reverted accidentally by a transaction that never knew about it) or write skew (two transactions read data and then write consistent updates that conflict when merged).

Next up: the different transaction isolation levels available in MVCC. Update: read part 2 here.

[1] Yes, you could use some sort of fairness algorithm, but that still doesn’t solve the queueing problem.

How we sped up our background processing 150x

Performance has always been an obsession of mine. I enjoy the challenge of understanding why things take as long as they do. In the process, I often discover that there’s a way to make things faster by removing bottlenecks. Today I will go over some changes we recently made to Privy that resulted in our production application sending emails 150x faster per node!

Understanding the problem

When we starting exploring performance in our email queueing system, all our nodes were near their maximum memory limit. It was clear that we were running as many workers as we could per machine, but the CPU utilization was extremely low, even when all workers were busy.

Anyone with experience will immediately recognize that this means these systems were almost certainly I/O bound. There’s a couple obvious ways to fix this. One is to perform I/O asynchronously. Since these were already supposed to be asynchronous workers, this didn’t seem intuitively like the right answer.

The other option is to run more workers. But how do you run more workers on a machine already running as many workers as can fit in memory?

Adding more workers

We added more workers per node by moving from Resque to Sidekiq. For those who don’t know, Resque is a process-based background queuing system. Sidekiq, on the other hand, is thread-based. This is important, because Resque’s design means a copy of the application code is duplicated across every one of its worker processes. If we wanted two Resque workers, we would use double the memory of a single worker (because of the copy-on-write nature of forked process memory in linux, this isn’t strictly true, but it was quite close in our production systems due to the memory access patterns of our application and the ruby runtime).

Making this switch to Sidekiq allowed us to immediately increase the number of workers per node by a factor of roughly 6x. All the Sidekiq workers are able to more tightly share operating system resources like memory, network connections, and database access handles.

How did we do?

This one change resulted in a performance change of nearly 30x (as in, 3000% as fast).

Wait, what?

Plot twist!

How did running more workers also result in a performance increase of 500% per worker? I had to do some digging. As it turns out, there’s a number of things that make Resque workers slower:

  • Each worker process forks a child process before starting each job. This takes time, even on a copy-on-write system like linux.
  • Then, since there are now two processes sharing the same connection to redis, the child has to reopen the connection.
  • Now, the parent will have to wait on the child process to exit before it can check the queue for the next job to do.

When we compounded all of these across every worker, it turns out these were, on average, adding a multiple-seconds-long penalty to every job. There is almost certainly something wrong here (and no, it wasn’t paging). I’m sure this could’ve been tuned and improved, but I didn’t explore since it was moot at this point anyway.

Let’s do better – with Computer ScienceTM

In the course of rewriting this system, we noticed some operations were just taking longer than felt right. One of these was the scheduling system: we schedule reminder emails to be sent out in redis itself, inserting jobs into a set that is sorted by time. Sometimes things happen that require removing scheduled emails (for example, if the user performs the action we were trying to nudge them to do).

While profiling the performance of these email reminders, I noticed an odd design: whenever the state of a claimed offer changes (including an email being sent), all related scheduled emails are removed and re-inserted (based on what makes sense for this new state). Obviously, this is a good way to make sure that anything unnecessary is removed without having to know what those things are. I had a hunch: If the scheduled jobs are sorted by time, how long would it take to find jobs that aren’t keyed on time?

O(n). Whoops!

It turns out that the time it took to send an email depended linearly on how many emails were waiting to be sent. This is not a recipe for high scalability.

We did some work to never remove scheduled jobs out of order – instead, scheduled jobs check their validity during runtime and no-op if there is nothing to do. Since no operations depend linearly on the size of the queue any more, its a much more scalable design.

By making this change, we saw an increase in performance of more than 5x in production.

Summing up

  • Moving from process-based to thread-based workers: ~6x more workers per node.
  • Moving from forking workers to non-forking workers: 5x faster.
  • Removing O(n) operations from the actual email send job: 5x faster.
  • Total speedup: Roughly 150x performance improvement.