Privy Engineering

Building Privy.com

Week one report – Reef Loretto

Reef Loretto joined the Privy engineering team in August. We ask for a new engineer’s observations as part of their onboarding process, and Reef submitted his in essay format, so we decided to post it here. It has been lightly edited for the audience.

This week I started a new job at Privy. Already, there are lots of things I’ve noticed which make me incredibly excited to work, learn, and grow with the team. First among these is the clear and visible value placed upon a smooth and enjoyable onboarding process. On day one, I came to my desk and was able to go from “zero” to “functional dev environment” before lunch. The team maintains carefully-written onboarding documentation, which includes a very useful bash script to get a docker environment up and running. The script ran with no issues, and then all it took was a simple `docker-compose up` to get the entire application running locally. I learned immediately that the docker configuration greatly reduces the pains of trying to create a local environment resembling those of staging and deployment (which was a significant cause of stress in previous projects/teams).

On my last co-op, which is the only other professional engineering experience I’ve had until this point, this same process of getting “up and running” was prolonged unnecessarily: I didn’t initially have admin privileges on my machine (and as a result couldn’t install anything properly); there was a process of “purchasing” software through an internal sales portal that required a sequence of approvals (any one of which could turn into a multi-day bottleneck); and similar things which I presume were the result of scale and company maturity. Thankfully, I am not faced with such issues currently.

I got code pushed to `master` on day one. Yes, it was just one line, but the emphasis placed on jumping right into the team’s development flow is great; and getting a really trivial PR in on day one is almost like a greeting (“helloooo everyone I made it 👋”) to the rest of the engineering team. It’s a good idea.

The rest of the week has been a gradual process of getting ramped up in the company’s codebase. The head engineer tagged a couple `starter` stories for me to work on, which have sort of been a compass for me to use while navigating through the massive number of classes, tests, configuration files, etc. which make up the core product.

The low point of this week occurred on Thursday when I learned about some internal changes that had occurred since I signed my offer letter in March. The changes themselves weren’t negative per se, but they represented another set of “new things” I had to deal with in parallel to everything else. I experienced a minor shock but got over it pretty quickly.

There have been other good things, but I want to get to something else that’s been on my mind. In my one-on-one discussion yesterday with the head engineer, I realized that I may have been overly concerned about trying to make my first week as productive as possible. This meant that I didn’t spend as much time with the recommended resources on Ruby, Rails, and other guides which are intended to help developers new to the language/framework write good code. In week two, I intend to focus both on development work as well as tasks that optimize for long-term productivity, which at this point means learning as much as possible about the company, its team, and the technologies I’ll be using both now and in the future.

Fixer Currency Gem

Emily Wilson is an engineering intern with Privy for the summer of 2018. She is part of the Georgia Institute of Technology’s class of 2021.

Earlier this month we published a new Ruby Gem that handles fetching updated currency conversion rates. We previously used the GoogleCurrency gem to fetch the exchange rates, but the Google endpoint that the gem relies on is no longer supported. This caused errors when we attempted to exchange currencies. After looking at replacements for the gem, we decided it would be best to fork the GoogleCurrency gem and modify it to meet our needs.

Why it was needed

The biggest feature we found lacking when looking at alternative gems was the support for lazy loading. Without this feature, we would have needed to place the call to fetch exchange rates in an initializer so that rates would be present should currency conversion be necessary. As a result, on every server spin-up — in production or development — an API call would have been made to get the most recent rates from the bank. This is problematic because most of these gems make requests to services that limit the number of API calls an account can make in one month. Due to this restriction there would have been a possibility that we run out of API calls during a production server spin-up, resulting in initialization errors.

After looking at the code where currency conversion was required, we decided that lazy loading was absolutely necessary in our case. Most user interactions don’t require currency exchange, so it would have been wasteful to make so many eager API calls. Furthermore, the dependency on a third-party service during server initialization would be a big risk given the probability of network errors — we wouldn’t want our servers’ uptime to be affected by a failing third-party service.

Building the gem

After forking the GoogleCurrency gem, we modified the source code to read currency exchange rates from fixer.io: an API for current and historical exchange rates. Using our new gem, we only make requests to Fixer when we need to exchange currencies, caching the results for up to 24 hours. This way, we never make eager API calls, saving us from many of the problems discussed earlier.

The gem is published on rubygems.org and the code for the gem can be found on the Privy Github site.

Updates to our list of excluded security issues

We’ve been excited to receive a number of vulnerability reports from security researchers all over the world since launching our security disclosure page earlier this year, and we’ve learned a lot about this process since we’ve published it. Privy is a more secure platform today because of the many reports we received.

One thing that’s become clear is that we need to help researchers understand what we consider to be valid reports. The first version of our security page made a reference to two classes of excluded issues: XSS/Self-XSS and campaign asset information disclosure. We wanted to nip those in the bud, since we have received a steady stream of reports since even before we had an official channel to report security issues. I’d like to briefly elaborate on why these two issue types are excluded:

XSS/Self-XSS: Many parts of the Privy dashboard expose advanced inputs that allow JavaScript code to be entered, and this code executes under various conditions. If a user enters malicious code in these inputs, then bad things can happen. However, this requires that an authorized user deliberately enter the code himself, so it is not a security exploit. The only users the code can affect are other users on the account, and the merchant’s own website visitors.

Campaign asset information disclosure: We occasionally receive security reports that campaign metadata or assets can be read under various unintended conditions (for example, getting a coupon code by inspecting network requests, instead of actually signing up for an offer with an email address). A merchant’s promotional campaigns are intended for public consumption, so we don’t make any guarantees about their privacy. Thus, exposure of assets like photos, coupon codes, free downloads, etc are not eligible for a security bounty or acknowledgment on our credits page.

These long-standing policies have served us well by allowing us to focus our time and energy on security vulnerability reports that are likely to pose the biggest threats to our merchants and our platform. Today, we’re updating the bounty exclusions with five new categories: theoretical attacks, brute force attacks, attacks requiring physical access, account email management, and clickjacking.

In short, we’ve found that reports that fall in these categories tend to offer low returns on our effort, and sometimes run counter to our business goals. For example, we’ve seen multiple times that closing the reported vulnerabilities required breaking third party integrations, or important scenarios for many of our customers. Many reports in these areas targeted a series of deliberate and well-considered design tradeoffs we made, not an oversight. In others, we considered the likelihood of the various attack scenarios, the severity of the issue, and the potential breadth of the impact, and concluded the risk was small enough to ignore (attacks that require gaining physical access to a victim’s device fall under this category).

There are valid scenarios in which combining multiple vulnerabilities can lead to actionable issues. Our goal is to evaluate those reports on a case by case basis, but we will generally not act on a straightforward vulnerability that falls under one of the above listed categories. We hope that these updated guidelines save time — for both our team and researchers. Happy hunting.

Our Commitment to Candidates

Interviewing for a startup job can be grueling, confusing, and demoralizing even when the process is going smoothly.

Last year, we made an internal effort to adopt more of the practices that we, as candidates, would like to see. Now that we’re hiring again, we’re publicly committing to these standards:

Interviewing

  • We’re going to give all candidates the opportunity to present their best selves. We understand not all candidates work well under high pressure live coding exercises or whiteboarding problems. Therefore, every candidate will have the option to assemble code samples, personal project demos, open source contributions, or take home problems to present as part of a job application.
  • We’re going to remove some biases from our system – this includes anonymizing schools (and in some cases, prior companies) so we can focus on the training and experience that a candidate has.

Communication

  • We’ll do our best to personally respond to every applicant. This includes applicants that we decline to interview for an initial screen.
  • Once you begin interviewing, we’ll do our best to communicate at least once every 7 days, so you always know where you stand.

These commitments will consume time and energy. It will be harder to do than not do, and therefore require conscious effort. At the end of the day, we want Privy to be known for being a great place to work and interview. We hope that this gives all candidates the confidence to apply.

November 25, 2016 Outage Postmortem

On Friday, November 25th, beginning at 1:32PM eastern US time, the Privy.com platform suffered an outage lasting roughly 3 hours.

During this incident, the Privy dashboard was completely unavailable – including signing up for a new account, logging in to an existing account, and managing your campaigns.

A large proportion of campaigns (fluctuating between ~20% and ~70%) failed to load on our customers’ websites. When they did load, they often took up to 30 seconds to do so. Users could opt into these campaigns, but form submissions were slow to process and returned with an error message, even though the submissions were successful. Thank-you pages did not display on a successful signup.

Our email and contact sync systems were unaffected. All successful signups synced to their configured destinations, and emails (autoresponders and scheduled sequences) sent as usual to their recipients.

The proximate cause of this issue was that our database systems were overwhelmed. The engineering team at Privy made preparations for the Black Friday weekend, resulting in roughly 4 times the usual computing resources being available. However, there were a few unanticipated performance problems that became magnified under the stress. In addition, a bug caused by incompatible third party code resulted in a subset of our accounts unintentionally sending up to 20 times more activity data than they should have. Together, these issues generated a workload that our systems could not handle.

Privy engineering immediately investigated these issues and deployed emergency workarounds to restore full availability. By about 4:41PM, our systems began to recover. Today, I am happy to report that all of these workarounds have been removed, and that the identified performance issues have been addressed.

However, in our focus to solve the issue at hand, the engineering team initially failed to communicate the impact, expected time to resolution, and other important details of the incident to our customer support team, which resulted in a lack of details, contradicting information, and an overall frustrating experience for both our support team and customers.

Here are the things we have done to ensure this doesn’t happen again:

  • Updated our incident handling documentation to more quickly identify, communicate, and resolve common problems.
  • Changed our engineering roadmap to ensure that in the future, we can broadcast important news and status updates to our customers, instead of in one-on-one conversations.
  • Significantly improved key bottlenecks in our platform to handle more load concurrently.

Despite all our preparations, we fell short on one of the most important days of the year, and we’ll do everything we can to ensure that this doesn’t happen again. Thank you for using Privy.

Intercom Conversation Stats: an Open-Source Tool by Privy

Andrew Knollmeyer is an engineering intern with Privy for summer 2016.

Screen Shot 2016-08-02 at 4.19.34 PM

Introducing Intercom Conversation Stats, a tool developed by Privy which is free for anyone to use! This app allows you to gather information about your conversations in Intercom and store it in a Google Sheets document on a regular basis. The provided build aggregates data on conversation tags, but it can be customized to work with any other data from your conversations as well.

Screen Shot 2016-08-02 at 4.17.00 PM

Why Build it?

When we first started using Intercom, we were onboarding 5-10 new users per day. With our “all hands” approach to support, it was easy to quantify issues and feature requests that were coming up in support chats. But that didn’t last long.

As we continue to scale, we’re now onboarding 250+ new users per day, and chats have climbed to 30 or 40 per day. Along the way, it’s become a lot harder to quantify and truly internalize trends in user feedback as these numbers continue to grow.

This tool was designed in order for us to better evaluate which areas of our product need to be worked on. During development, we realized that the support for this kind of system was not all there, and that no one else had publicly released a tool which makes this sort of thing easy to build, so we decided to make it open-source as well.

How Does it Work?

Intercom Conversation Stats is a Rails app which integrates with the Google and Intercom APIs, and uses Sidekiq to automate all of its processes. First, a webhook is sent from Intercom to the application whenever a new conversation is created by one of your users. Intercom Conversation Stats then stores the ID of that conversation in a table, so that conversation’s data can be accessed later. By default, at the end of the week, the application pulls the tags from each of this week’s conversations and counts how many times each tag appeared. The count for each tag is then stored in its respective row in Google Sheets, along with the percent of tag mentions each tag accounted for.

How Can I Use it?

The GitHub repository can be accessed here. Instructions on setup and customization can be found in the README on that page.

Building a BellBot

Ever feel like ringing a bell requires too much effort? Ever wish you could automate it to ring when something – like a sale – happens? If you responded “yes” to at least one of these questions, fret not. There is now a solution: BellBot.

This tutorial will guide you through building your own BellBot to automate bell ringing! The BellBot system is made of:

  1. A node.js server hosted on a local machine.
  2. An Arduino Uno connected to the machine via USB.
  3. A servo connected to the Arduino.
  4. An ordinary call bell.

Required Hardware

Arduino Uno

The Arduino Uno is a microcontroller board based on the ATmega328P chip. It’s a popular board due to its low cost, ease of use, open-source nature, and strong community. We will use it as the interface between our node.js server and the servo.

Servo

A servo is a device that has an output shaft whose angle can be precisely controlled through an input line. Our servo will be responsible for striking the bell. To do so we attach an arm to the servo’s output shaft and align its end with the bell, as seen in the video. Using a pencil for the servo’s arm is a fast and easy option for prototyping.

Breadboard & Jumper Cables

A small basic breadboard and some jumper cables are perfect for fast prototyping. In this project we’ll be using the power rails to provide common +5V and GND (ground) nodes.

Capacitors

We’ll be powering the system via USB in this project. Thus, adding smoothing capacitors should help stabilize the circuit and protect against current spikes. A 200uF capacitor works well.

Set up the circuit

Follow the diagram and schematic below to set up the circuit.

  1. Connect the servo’s Vin to the Arduino’s +5V pin.
  2. Connect the servo’s GND and any of the Arduino’s GND pins to a common ground.
  3. Connect the servo’s control line to one of the digital I/O pins on the Arduino. This tutorial uses pin 8.
  4. Add a capacitor or two in parallel to the +5V pin and the common GND to protect against possible current spikes. Remember that capacitors, unlike resistors, add up in parallel.
  5. Finally, connect the Arduino to your computer with A USB cable. This will power up the circuit and enable communication with the host computer.

Diagram for a USB-powered BellBot.

Diagram for a USB-powered BellBot.

Schematic for a USB-powered BellBot.

Schematic for a USB-powered BellBot.

Note: If you’d rather power the system through an external power supply, read Using an External Power Supply under the Notes section.

BellBot in action

BellBot in action

Write the Arduino code

The Arduino code will consist of the standard firmata example included in the IDE. The Arduino IDE is the environment in which we write and upload sketch code to the Arduino Uno. You can download the IDE here.

  1. In the Arduino IDE, go to File > Examples > Firmata to find the standard firmata code. The code implements a communication protocol between the Arduino and any host computer. With this protocol and the Johnny Five module, we can control the Arduino within node.js.
  2. Save the standard firmata code and then select your board from Tools > Board
  3. Finally, upload the sketch to the Arduino.

Writing our own Arduino sketch to control the servo is also an option. But, we would lose the abstraction and simplicity of controlling the hardware within node.js. Checkout Sketch code under the Notes section if you’re interested in writing your own Arduino code.

Write the node.js code

Install node.js

We will be using node.js as our local web server. It will listen for incoming POST requests to the /ring route and relay them to the Arduino. Node.js describes itself as “a JavaScript runtime built on Chrome’s V8 JavaScript engine. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient. Node.js’ package ecosystem, npm, is the largest ecosystem of open source libraries in the world.” You can read more about it here.

  1. If you haven’t already, download Node.js here.

Install project dependencies

  1. Once node.js is installed, run the following command in your project’s directory to create a new package.json file. This file maintains info about the project and its dependencies.
    npm init
  2. Next, install the following node modules into the auto-generated local node_modules folder. By including the --save option in the following commands, npm updates package.json automatically.
    express – A popular framework we will use to set up a web server.
    npm install express --save
    body-parser – A tool used by express to parse requests’ bodies
    npm install body-parser --save
    ngrok – A tool to tunnel localhost to a public URL hosted by ngrok
    npm install ngrok --save
    johnny-five – A vast library to control the Arduino and hardware within node.js
    npm install johnny-five --save

If you cloned the latest BellBot code, you only need to run,

npm install

npm will look at the package.json file and install all the required modules.

Write args.js

This module parses command line arguments that we supply. Optional arguments include:

  • proto – the application protocol to use. Default: ‘http’
    --proto=<application_protocol>
  • port – the networking port number for the server. Default: 3000
    --port=<port_number>
  • start – the starting and resting angle of the servo arm. Default: 75.
    --start=<starting_angle>
  • strike – the angle of the servo arm when striking the bell. Default: 90
    --strike=<striking_angle>
  • pin – the arduino output pin that controls the servo’s angle. Default: 8
    --pin=<arduino_pin>
  • sdelay – the amount of time in milliseconds to wait before resetting the servo’s arm after it has struck the bell. Default: 500.
    --sdelay=<servo_delay>
  • pdelay – the amount of time in milliseconds to wait before processing the next command. Default: 1000
    --pdelay=<process_delay>
  • key – the optional key to authenticate request with. Default: null
    --key=<authentication_key>

If the BELLBOT_KEY environment variable exists and no key is supplied in the arguments, the server will authenticate incoming requests using the BELLBOT_KEY. We explain how to set up a BELLBOT_KEY later.

Write config.js

This module creates an object that describes the BellBot’s behavior. It uses our args.js module to override default settings. Note that the URL is initially blank. We later set this URL to the unique public URL that ngrok supplies us.

Write servo_controller.js

This module provides a constructor for a servo controller. Note that the controller’s timeout property corresponds to the processing delay – that is, the delay between sending requested commands.

Write index.js

This is the main entry point of the app. The file:

  1. Require‘s all the needed components.
  2. Connects to the Arduino.
  3. Starts up an ngrok process.
  4. Starts the express server.

Add a .bellbotrc file

If you want to add some basic authentication to verify requests with a password, create a .bellbotrc file in your home directory with a specified BELLBOT_KEY:

  1. Create the .bellbotrc file:
    touch ~/.bellbotrc
  2. Add the following line to the .bellbotrc file:
    export BELLBOT_KEY=<your_password>
  3. In the command line, run
    source ~/.bellbotrc
    to source the newly added environment variable.

Next time you start your BellBot server, it will authenticate requests with that specified password. The incoming POST requests must have a key or token body parameter that matches the password.

Test it out

Boot up the BellBot by running

node index.js

in the command line.

Pass optional arguments if you wish. For example,
node index.js port=3001 pin=9 key=my-secret-key

will start the BellBot with:

  1. The server listening on port 3001
  2. The arduino outputting on digital I/O pin 9
  3. Authentication using my-secret-key as the password

It is important to make sure BellBot is behaving properly before unleashing it upon the world. After starting your BellBot,

  1. Use curl or a program like Postman to build and send test POST requests. If you are authenticating requests, pass token: <BELLBOT_KEY> as a body parameter.
  2. Test locally by sending POST requests to localhost from your machine.
  3. Next, verify that the public address supplied by ngrok is hitting your local server by using the ngrok URL as the POST destination.

Make it a Slack-controlled BellBot

A neat way to automate bell-ringing is to use Slack. In slack, you can add and configure outgoing webhooks. For example, let’s add a webhook that sends a customized POST request whenever a message beginning with :bell: comes in.

Creating a new Slack Outgoing WebHook.

Creating a new Slack Outgoing WebHook.

To do this:

  1. Create a new outgoing webhook: Apps & Integrations > Manage > Custom Integrations > Outgoing WebHooks > Add Configuration.
  2. Select the slack channel that the webhook will be listening in. For example, a “sales” or “bots” channel.
  3. Set the trigger word that will cause the POST request to be sent. For example, the :bell: emoji.
  4. Set the URL to the public ngrok URL that maps to your local server.
  5. If your BellBot is configured to require authentication, set the Token to your secret BELLBOT_KEY.

Now all that’s left is to make sure that the specified channel receives an automated message that triggers the webhook whenever a sale occurs.

A sales Slack channel that listens for incoming :bell: messges.

A sales Slack channel that listens for incoming :bell: messages.

Notes

Servos

A servo uses error-sensing negative feedback to correct and maintain its shaft’s angle. The signal sent to the input is a PWM voltage signal. The width of the pulse is what controls the angle of the shaft. Check out this Seattle Robotics guide for more details.

Using an External Power Supply

Servos are notorious for drawing lots of power. When using USB power only, the 5V output on the Arduino is what powers the servo.The high current drawn during the servo’s rotation may exceed the limits of the USB ports (often ~500mA at 5V). When the current exceeds this threshold, the host computer may disable the port. Additionally, the Arduino may reset and behave erratically due to voltage sags. Check the servo’s data sheet for its idle, running, and stall currents. You want them to be low enough to be supplied via USB. Adding a capacitor in parallel to the 5V output and ground can help protect against these undesired effects.

To avoid these concerns altogether, follow the diagram and schematic below to set up an externally-powered BellBot.

  1. Power the servo and Arduino through an external power supply. Four AA 1.5V batteries provides ~6V, which should be enough to power both components reliably.
  2. Connect the voltage source’s output to the Vin input on the Arduino.
  3. Connect the voltage source’s output to the Vin of the servo.
  4. Connect the voltage source, servo, and Arduino to a common ground.
  5. Connet the servo’s control line to one of the digital I/O pins on the Arduino.
  6. Connect the Arduino to your computer using a USB cable. The USB connection’s role is reduced to serial communication. The Arduino will use the battery power as long as it’s switched on, even when the USB is connected. If you switch off the battery, the USB connection will power the Arduino but not the servo.
Diagram for BellBot running on an External Power Supply.

Diagram for BellBot running on an External Power Supply.

Schematic for BellBot running on an External Power Supply.

Schematic for BellBot running on an External Power Supply.

Sketch code

An alternative option for BellBot’s software is to write your own Arduino code. For example, the following sketch listens for the ‘R’ character to ring the bell.

The node.js server code would need refactoring. It would no longer require the whole Johnny Five module, and would use just the Serialport module to send data to the Arduino in a more barebones fashion.

Estimated cost

Item Cost ($)
Arduino Uno 24.95
Feetech Servo 12.95
Breadboard 4.95
Jumper Pack 1.95
Capacitor Pack 6.95
Total 51.75

Next Steps

Enhance your BellBot by implementing these ideas:

Add more routes

More routes to your express server would expand the possible types of requests. For example, a /sweep route could cause the servo to begin sweeping. Or, if there are many bells and servos, a REST API could allow a user to control each servo.

Create a web interface

A web app or mobile phone app for this project could allow users to monitor or control the states of the servos. Admins or users could analyze or act on statistics and data logs of requests.

Use more hardware

LEDs could light up whenever the servo is working or a request comes in. A network of servos could handle different types of requests. Light sensors could control the operation state of the servos (on if the lights are on). These are just some ideas!

Build a black box

Rather than exposing the system as a prototype, you could build a professional PCB and enclosure to abstract away the system.

Use the Raspberry Pi

It should be straightforward to port this project to a Raspberry Pi. The node.js server would run on the Pi itself, with the servo connected to its GPIO pins. No USB connection would be necessary.

Further Reading

 

Excuses not to Test

At Privy, one of our values is pragmatism – so we don’t require formal proofs of correctness and all-du-paths coverage to check in code, because it’s not cost effective (even if those things are valuable in the abstract). But this is such a widely accepted belief that it essentially conveys no information at all; outside of extraordinary operations (like NASA), no one requires 100% path coverage. So how do we determine the what and how much to test?

bridge

How do I know if it needs tests?

First, it helps to understand some models we use to form the basis of our intuitions. The most important is the Pareto Principle: the general observation that — most of the time — 80% of a result comes from 20% of the effort. This means we should concentrate our testing on areas that are most likely to make development more efficient, and prevent the majority of the really nasty bugs. Naturally, that means things like complicated but critical core classes, and modules that are emotionally fraught or have difficult-to-untangle side effects like billing and subscription management.

The second is the idea that testing is (among other things) a form of protection from downside risks – lost users/customers, negative press/brand value, bad builds that waste engineer time, etc. Like insurance, you want your protection to be proportionate to the amount of downside risk you are exposed to. A young startup with very few customers/revenue/engineers has, in absolute terms, very little downside. It should act accordingly. Sometimes, at Privy that means we have a lot of code that we haven’t gotten around to testing yet, or even consciously decide that it will not be tested for the foreseeable future.

Third, there is a lot of context to consider. How fault-tolerant are your users? How experienced is your team? Are these variables going to trend up or trend down over time? Generally, a less experienced engineer should write more tests, for the same reason an inexperienced driver should be more deliberate and unfailing in using turn signals. It will also have beneficial side effects, like making it obvious what code is hard to test, and therefore highly likely to be architecturally suspect. On the other end of the spectrum, if you are a star engineer tasked with inventing the modern internet and have six weeks to do it, you will probably decide to skip a unit test here or there.

But enough of that. Now we have a high level mental model to use as a general framework for deciding at whether tests will be useful; below, I’ve put together a list of some finer grained risk factor dimensions that may be useful for evaluating specific modules or classes.

What are the risk factors?

Note that many of these dimensions are not truly independent variables – erring on the side of completeness, there are sure to be some in this list that overlap or are causally related:

Maturity. How mature is the product?
Newer products, services, modules and classes can probably get away with less testing, as a result of the large amount of churn that is likely in the code, and the relatively lower downside risk – fewer users, downstream dependencies, etc.

Impact. If it were to fail, how bad would the result be?
Financial, reputational, or otherwise. Problems arising from defects that are hard or impossible to unwind (e.g., security compromises or data loss) deserve extra scrutiny. Maturity and impact go hand in hand – most young enterprises need to be focused on solving problems and building value, rather than protecting what little they have from downside risks.

Release Cadence. How quickly can an identified defect be resolved in production?
The faster you can deploy changes, the less risk overall that any given bug will have a material impact (some exceptions apply…think security). Continuous integration and deployment with highly effective test coverage is the surest way to have a fast release cadence, making defects in production less impactful.

Downstream dependencies. How many other modules/services depend on it?
More downstream dependencies means more risk, due to the coordination problem. It also implies your interfaces are stable and thus cheaper to test.

Upstream dependencies. How many other modules/services does it depend on?
More unstable upstream dependencies means more risk of breakage. If your dependencies are themselves not well tested, then testing “transitively” might be worthwhile.

Noisiness. If it were to fail, how soon would you notice?
Silent failures that take longer to detect deserve more scrutiny, because it’s usually harder to correct something the more time passes before it is discovered. Logs are rotated out, servers come out of service, repro steps are forgotten, etc.

Churn. How often is the code changing?
The more code is likely to change and be refactored, the more likely bugs will inadvertently be introduced. Conversely, the more interfaces are likely to change and be refactored, the more expensive tests will be to maintain, potentially tipping the cost/benefit equation.

Wrapping up

This is not meant to be an indictment of testing. Testing is — and should be — part of what it means to develop software professionally. This is more an attempt to formalize the real tradeoffs we’re balancing every day under heavy pressure, rather than providing a neat set of post-hoc rationalizations.

On the contrary, I think this helps highlight when we are making excuses instead of well-reasoned judgments — I’ve sometimes fallen into the trap of using one criteria from this list to argue for one approach or another, while ignoring another one that didn’t support my position. I hope that by laying out and curating this list over time, we’ll be able to make more balanced and consistent decisions around testing.

Database Concurrency, Part 2

This is part two of a series on database concurrency. Read the introduction at Database Concurrency, Part 1 .

Last time, I talked about multi-version concurrency control, or MVCC, and how it enables highly concurrent database systems while still guaranteeing transaction isolation. This is because MVCC allows reality (from the perspective of two distinct transactions) to diverge, giving us the unique advantage that readers and writers don’t have to block each other. But how does it achieve this in practice, and what are the caveats?

Let me step back a bit and define some terms:

Transaction: A unit of work with well defined start and end boundaries, composed of a number of operations.

Isolation: The property that makes concurrent transactions appear as if they were executing serially.

Isolation in practice

Because the isolation property requires concurrent transactions to appear as if they are executing serially (one after the other), they must not interfere with each other by definition. Unfortunately, isolation in SQL is not a boolean property, but one of degree; some workloads don’t benefit from full isolation, which is pretty slow. Here are some common isolation levels:

Read uncommitted: I like to think of this as “effectively, no isolation” because dirty read is allowed: uncommitted changes are globally visible. For example, you might have transactions:

T1:
Subtract $50 from A
Add $50 to B
T2:
Read balances from A and B

In this case its possible T2 executes in the middle of T1, and finds that $50 has disappeared without explanation. This makes it nearly impossible to reason about anything under concurrency.

Read committed: This level disallows dirty read, so it can only see committed transactions. However, it still allows non-repeatable read, an anomaly in which a transaction re-reads data and find that it has changed, if another transaction committed changes in between the two reads.

Repeatable read: This level disallows non-repeatable read. Unfortunately, it still allows phantom read, a different phenomenon where the same SELECT might return a different set of rows.

Wait – “non-repeatable” and “phantom” reads?

These two anomalies might at first seem identical, but according to the ANSI SQL spec, a read is “repeatable” or “non-repeatable” at the row level. Repeatable reads guarantee the same row will always have the same data.

A “phantom” read is a phenomenon at the result set level. This means the set might be different, even if no single row has changed. The simplest case is when an INSERT is committed in another transaction, causing it to be returned in a new read query.

Confusion here is justified, because this particular definition of “non-repeatable read” is not obvious: repeatable reads do not guarantee repeatable results. And there are a lot of sources where “phantom read” is said to be a special case of “non-repeatable read,” including Wikipedia[1], in contradiction to the spec. Aren’t databases fun?

Sometimes you need a sledgehammer

Ideally, we’d have an isolation level where dirty read, non-repeatable read, and phantom read weren’t allowed. So where does this lead us?

The good news: it turns out there are actually two distinct isolation levels that guarantee this, snapshot isolation and serializable.

The bad news: disallowing all the above anomalies is not enough to guarantee serializability, which is why there are two. We used to think snapshot isolation would prevent read anomalies, until some folks proved it couldn’t. Also, both Oracle and PostgreSQL have at one point or another called snapshot isolation “serializable,” even though it isn’t, and the SQL spec itself seems assume that preventing these three phenomena is equivalent to serializable isolation, even though this is easy to disprove.

So far this seems pedantic, so let’s look at an example. Imagine two empty tables A and B, and two concurrent transactions that insert a count of the rows in the other table:

T1:
INSERT into B count(*) from A;
T2:
INSERT into A count(*) from B;

If these two transactions run at the same time under snapshot isolation, they will both insert a single row with a 0. This makes sense, because each transaction has its own snapshot of the database, in which it sees the other table is empty. However, of the two possible serial orderings [T1,T2], and [T2, T1], neither of them is consistent with what actually happened. So these transactions are not serializable; but they also didn’t suffer any of the three anomalies we’ve defined so far. =(

We’re seeing a new anomaly not defined in the spec: write skew.

Enforcing isolation

The ANSI SQL isolation levels don’t actually prescribe how to implement each level, it only describes “prohibited phenomena.” In practice, most databases use a combination of version tracking and row locking. And most row locking implementations use some variation of reader-writer locks, which are common in many software systems. Readers don’t interfere with other readers, so they don’t block each other, but they block writers. Writers block readers, as well as each other.

Here is a table that shows when locks are needed at each isolation level, when using lock-based concurrency control:

Isolation level Write Operation Read Operation Range Operation
Read Uncommitted Statement Statement Statement
Read Committed Until Commit Statement Statement
Repeatable Read Until Commit Until Commit Statement
Serializable Until Commit Until Commit Until Commit

As you might have guessed, stronger isolation guarantees need more locking, decreasing performance. Note that at the lowest level of isolation, we release locks after every statement, which is very performant, but as we’ve seen, extremely unsafe. At the high end of serializable isolation, every lock we acquire to execute a statement is held until commit. This strategy has a special name: two phase locking.

To guarantee serializability, two phase locking has (you guessed it) two phases: an acquiring phase (releasing locks is not allowed), and a releasing phase (acquiring new locks is not allowed). Strict two phase locking goes further, and waits until the very end to release all locks at once during commit, which is helpful for preventing cascading aborts, although it costs concurrency and is more prone to deadlocking.

The big picture

The point of all this is that transaction isolation in MVCC is not easy to get right. Even at snapshot isolation — the highest[2] isolation level provided by some databases — it’s easy to run into bugs that will silently corrupt your data.

One mindset is to (stick your fingers in your ears and) 1) allow data to be randomly corrupted, or 2) forfeit performance by running your database under true serializable isolation. The other is to solve these problems at the application level, by deliberately structuring your transactions to avoid these problems.

Next up: How to work at read committed / repeatable read isolation and still prevent common anomalies at these levels!

[1] And some databases even treat “repeatable read” as equivalent to “serializable.”
[2] Technically though, snapshot isolation is not a superset of repeatable read, because SI allows anomalies that do not occur under RR. See “A5B Write Skew” in A Critique of of ANSI SQL Isolation Levels.

Reactive Systems: Part 1 – An Overview

This post is Part 1 of our series on Reactive Systems.

At Privy, many of our services are fundamentally event-driven. Indeed our core product value lies in helping merchants capture arbitrary user interaction and reacting to opportunities as they arise in a tangible and timely manner.

A key criterion for new components and systems at Privy is that they must be elastic. We must be able to out-scale our fastest growing merchants, if we are to continue to provide an acceptable level of service.

In addition to scalability, our systems must be fault tolerant or resilient in that failure of one component should not affect the overall integrity of our system.

Reactive Systems

Systems that are responsive, resilient, elastic and message-driven are also known as Reactive Systems. The Reactive Manifesto, a community distillation of best-practices, provides a concise vocabulary for discussing reactive systems.

Systems built as Reactive Systems are more flexible, loosely-coupled and scalable. This makes them easier to develop and amenable to change. They are significantly more tolerant of failure and when failure does occur they meet it with elegance rather than disaster. Reactive Systems are highly responsive, giving users effective interactive feedback.

Being Reactive in Practice

It is important to note that the Reactive philosophy is independent of any specific application layer; these general requirements can be realized throughout the stack.

Indeed there are many open source frameworks that can used to build Reactive systems. Front-end examples include Backbone.js, Facebook’s React, and Elm. These specific examples essentially handle input events and their effects as process networks.

A similar but distinct concept is the Actor model, which often arises in the context of highly concurrent and distributed background operations. The Actor model is a computational model that treats computational entities as primitives called actors. Actor behavior is defined by the ability to respond to messages received from other actors, the ability to send messages to other actors, and the ability to spawn new actors.

Actors saw much success in Erlang, a language originally designed for building telecommunication systems. For use-cases more specifically related to web applications, two popular Actor-based frameworks are Akka and Celluloid.

Celluloid is the underlying actor system used in Sidekiq, a background task framework for Ruby. Sidekiq is an integral component of the Privy backend – most of our asynchronous Ruby behavior occurs within Sidekiq Workers.

We’re also within the early stages of deploying an Akka app for Business Intelligence, which happens to be the primary motivator for this series of blog posts.

Handling Streams of Data: Part 2

In the next post in this series, we’ll examine methods for dealing with streams of data in an asynchronous and reactive, specifically responsive and elastic, manner.