Privy Engineering

Building Privy.com

Our Commitment to Candidates

Interviewing for a startup job can be grueling, confusing, and demoralizing even when the process is going smoothly.

Last year, we made an internal effort to adopt more of the practices that we, as candidates, would like to see. Now that we’re hiring again, we’re publicly committing to these standards:

Interviewing

  • We’re going to give all candidates the opportunity to present their best selves. We understand not all candidates work well under high pressure live coding exercises or whiteboarding problems. Therefore, every candidate will have the option to assemble code samples, personal project demos, open source contributions, or take home problems to present as part of a job application.
  • We’re going to remove some biases from our system – this includes anonymizing schools (and in some cases, prior companies) so we can focus on the training and experience that a candidate has.

Communication

  • We’ll do our best to personally respond to every applicant. This includes applicants that we decline to interview for an initial screen.
  • Once you begin interviewing, we’ll do our best to communicate at least once every 7 days, so you always know where you stand.

These commitments will consume time and energy. It will be harder to do than not do, and therefore require conscious effort. At the end of the day, we want Privy to be known for being a great place to work and interview. We hope that this gives all candidates the confidence to apply.

November 25, 2016 Outage Postmortem

On Friday, November 25th, beginning at 1:32PM eastern US time, the Privy.com platform suffered an outage lasting roughly 3 hours.

During this incident, the Privy dashboard was completely unavailable – including signing up for a new account, logging in to an existing account, and managing your campaigns.

A large proportion of campaigns (fluctuating between ~20% and ~70%) failed to load on our customers’ websites. When they did load, they often took up to 30 seconds to do so. Users could opt into these campaigns, but form submissions were slow to process and returned with an error message, even though the submissions were successful. Thank-you pages did not display on a successful signup.

Our email and contact sync systems were unaffected. All successful signups synced to their configured destinations, and emails (autoresponders and scheduled sequences) sent as usual to their recipients.

The proximate cause of this issue was that our database systems were overwhelmed. The engineering team at Privy made preparations for the Black Friday weekend, resulting in roughly 4 times the usual computing resources being available. However, there were a few unanticipated performance problems that became magnified under the stress. In addition, a bug caused by incompatible third party code resulted in a subset of our accounts unintentionally sending up to 20 times more activity data than they should have. Together, these issues generated a workload that our systems could not handle.

Privy engineering immediately investigated these issues and deployed emergency workarounds to restore full availability. By about 4:41PM, our systems began to recover. Today, I am happy to report that all of these workarounds have been removed, and that the identified performance issues have been addressed.

However, in our focus to solve the issue at hand, the engineering team initially failed to communicate the impact, expected time to resolution, and other important details of the incident to our customer support team, which resulted in a lack of details, contradicting information, and an overall frustrating experience for both our support team and customers.

Here are the things we have done to ensure this doesn’t happen again:

  • Updated our incident handling documentation to more quickly identify, communicate, and resolve common problems.
  • Changed our engineering roadmap to ensure that in the future, we can broadcast important news and status updates to our customers, instead of in one-on-one conversations.
  • Significantly improved key bottlenecks in our platform to handle more load concurrently.

Despite all our preparations, we fell short on one of the most important days of the year, and we’ll do everything we can to ensure that this doesn’t happen again. Thank you for using Privy.

Intercom Conversation Stats: an Open-Source Tool by Privy

This is a guest post by Andrew Knollmeyer.

Screen Shot 2016-08-02 at 4.19.34 PM

Introducing Intercom Conversation Stats, a tool developed by Privy which is free for anyone to use! This app allows you to gather information about your conversations in Intercom and store it in a Google Sheets document on a regular basis. The provided build aggregates data on conversation tags, but it can be customized to work with any other data from your conversations as well.

Screen Shot 2016-08-02 at 4.17.00 PM

Why Build it?

When we first started using Intercom, we were onboarding 5-10 new users per day. With our “all hands” approach to support, it was easy to quantify issues and feature requests that were coming up in support chats. But that didn’t last long.

As we continue to scale, we’re now onboarding 250+ new users per day, and chats have climbed to 30 or 40 per day. Along the way, it’s become a lot harder to quantify and truly internalize trends in user feedback as these numbers continue to grow.

This tool was designed in order for us to better evaluate which areas of our product need to be worked on. During development, we realized that the support for this kind of system was not all there, and that no one else had publicly released a tool which makes this sort of thing easy to build, so we decided to make it open-source as well.

How Does it Work?

Intercom Conversation Stats is a Rails app which integrates with the Google and Intercom APIs, and uses Sidekiq to automate all of its processes. First, a webhook is sent from Intercom to the application whenever a new conversation is created by one of your users. Intercom Conversation Stats then stores the ID of that conversation in a table, so that conversation’s data can be accessed later. By default, at the end of the week, the application pulls the tags from each of this week’s conversations and counts how many times each tag appeared. The count for each tag is then stored in its respective row in Google Sheets, along with the percent of tag mentions each tag accounted for.

How Can I Use it?

The GitHub repository can be accessed here. Instructions on setup and customization can be found in the README on that page. If you would like to request instructions for setup on a platform other than Heroku, feel free to contact us via intercom@privy.com

Excuses not to Test

At Privy, one of our values is pragmatism – and that means avoiding dogmas that are inflexible and impractical. So naturally, we don’t enforce test-driven-development or 100% unit test coverage, even if those things are valuable in the abstract.

OK, so Privy is “pragmatic” and therefore doesn’t require formal proofs of correctness and all-du-paths coverage to check in code, because it’s not cost effective. This is such a common reaction to such a widely accepted belief that it essentially conveys no information at all; outside of extraordinary operations (like NASA), no one operates like that anyway. So how do we determine the what and how much to test?

bridge

How do I know if it needs tests?

First, it helps to understand some models we use to form the basis of our intuitions. The most important is the Pareto Principle: the general observation that — most of the time — 80% of a result comes from 20% of the effort. This means we should concentrate our testing on areas that are most likely to make development more efficient, and prevent the majority of the really nasty bugs. Naturally, that means things like complicated but critical core classes, and modules that are emotionally fraught or have difficult-to-untangle side effects like billing and subscription management.

The second is the idea that testing is (among other things) a form of protection from downside risks – lost users/customers, negative press/brand value, bad builds that waste engineer time, etc. Like insurance, you want your protection to be proportionate to the amount of downside risk you are exposed to. A young startup with very few customers/revenue/engineers has, in absolute terms, very little downside. It should act accordingly. Sometimes, at Privy that means we have a lot of code that we haven’t gotten around to testing yet, or even consciously decide that it will not be tested for the foreseeable future.

Third, there is a lot of context to consider. How fault-tolerant are your users? How experienced is your team? Are these variables going to trend up or trend down over time? Generally, a less experienced engineer should write more tests, for the same reason an inexperienced driver should be more deliberate and unfailing in using turn signals. It will also have beneficial side effects, like making it obvious what code is hard to test, and therefore highly likely to be architecturally suspect. On the other end of the spectrum, if you are a star engineer tasked with inventing the modern internet and have six weeks to do it, you will probably decide to skip a unit test here or there.

“You mean you don’t have 95% code coverage as a minimum requirement?” some say indignantly, “That’s not software engineering – that’s hobbyist amateurism; I hope you’re just building toy projects.” Actually, our enterprise marketing platform powers hundreds of millions of monthly page views, and is growing at double digit percentages on slow months.

But enough of that. Now we have a high level mental model to use as a general framework for deciding at whether tests will be useful; below, I’ve put together a list of some finer grained risk factor dimensions that may be useful for evaluating specific modules or classes.

What are the risk factors?

Note that many of these dimensions are not truly independent variables – erring on the side of completeness, there are sure to be some in this list that overlap or are causally related:

Maturity. How mature is the product?
Newer products, services, modules and classes can probably get away with less testing, as a result of the large amount of churn that is likely in the code, and the relatively lower downside risk – fewer users, downstream dependencies, etc.

Impact. If it were to fail, how bad would the result be?
Financial, reputational, or otherwise. Problems arising from defects that are hard or impossible to unwind (e.g., security compromises or data loss) deserve extra scrutiny. Maturity and impact go hand in hand – most young enterprises need to be focused on solving problems and building value, rather than protecting what little they have from downside risks.

Release Cadence. How quickly can an identified defect be resolved in production?
The faster you can deploy changes, the less risk overall that any given bug will have a material impact (some exceptions apply…think security). Continuous integration and deployment with highly effective test coverage is the surest way to have a fast release cadence, making defects in production less impactful.

Downstream dependencies. How many other modules/services depend on it?
More downstream dependencies means more risk, due to the coordination problem. It also implies your interfaces are stable and thus cheaper to test.

Upstream dependencies. How many other modules/services does it depend on?
More unstable upstream dependencies means more risk of breakage. If your dependencies are themselves not well tested, then testing “transitively” might be worthwhile.

Noisiness. If it were to fail, how soon would you notice?
Silent failures that take longer to detect deserve more scrutiny, because it’s usually harder to correct something the more time passes before it is discovered. Logs are rotated out, servers come out of service, repro steps are forgotten, etc.

Churn. How often is the code changing?
The more code is likely to change and be refactored, the more likely bugs will inadvertently be introduced. Conversely, the more interfaces are likely to change and be refactored, the more expensive tests will be to maintain, potentially tipping the cost/benefit equation.

Wrapping up

This is not meant to be an indictment of testing. Testing is — and should be — part of what it means to develop software professionally. This is more an attempt to formalize the real tradeoffs we’re balancing every day under heavy pressure, rather than providing a neat set of post-hoc rationalizations.

On the contrary, I think this helps highlight when we are making excuses instead of well-reasoned judgments — I’ve sometimes fallen into the trap of using one criteria from this list to argue for one approach or another, while ignoring another one that didn’t support my position. I hope that by laying out and curating this list over time, we’ll be able to make more balanced and consistent decisions around testing.

Database Concurrency, Part 2

This is part two of a series on database concurrency. Read the introduction at Database Concurrency, Part 1 .

Last time, I talked about multi-version concurrency control, or MVCC, and how it enables highly concurrent database systems while still guaranteeing transaction isolation. This is because MVCC allows reality (from the perspective of two distinct transactions) to diverge, giving us the unique advantage that readers and writers don’t have to block each other. But how does it achieve this in practice, and what are the caveats?

Let me step back a bit and define some terms:

Transaction: A unit of work with well defined start and end boundaries, composed of a number of operations.

Isolation: The property that makes concurrent transactions appear as if they were executing serially.

Isolation in practice

Because the isolation property requires concurrent transactions to appear as if they are executing serially (one after the other), they must not interfere with each other by definition. Unfortunately, isolation in SQL is not a boolean property, but one of degree; some workloads don’t benefit from full isolation, which is pretty slow. Here are some common isolation levels:

Read uncommitted: I like to think of this as “effectively, no isolation” because dirty read is allowed: uncommitted changes are globally visible. For example, you might have transactions:

T1:
Subtract $50 from A
Add $50 to B
T2:
Read balances from A and B

In this case its possible T2 executes in the middle of T1, and finds that $50 has disappeared without explanation. This makes it nearly impossible to reason about anything under concurrency.

Read committed: This level disallows dirty read, so it can only see committed transactions. However, it still allows non-repeatable read, an anomaly in which a transaction re-reads data and find that it has changed, if another transaction committed changes in between the two reads.

Repeatable read: This level disallows non-repeatable read. Unfortunately, it still allows phantom read, a different phenomenon where the same SELECT might return a different set of rows.

Wait – “non-repeatable” and “phantom” reads?

These two anomalies might at first seem identical, but according to the ANSI SQL spec, a read is “repeatable” or “non-repeatable” at the row level. Repeatable reads guarantee the same row will always have the same data.

A “phantom” read is a phenomenon at the result set level. This means the set might be different, even if no single row has changed. The simplest case is when an INSERT is committed in another transaction, causing it to be returned in a new read query.

Confusion here is justified, because this particular definition of “non-repeatable read” is not obvious: repeatable reads do not guarantee repeatable results. And there are a lot of sources where “phantom read” is said to be a special case of “non-repeatable read,” including Wikipedia[1], in contradiction to the spec. Aren’t databases fun?

Sometimes you need a sledgehammer

Ideally, we’d have an isolation level where dirty read, non-repeatable read, and phantom read weren’t allowed. So where does this lead us?

The good news: it turns out there are actually two distinct isolation levels that guarantee this, snapshot isolation and serializable.

The bad news: disallowing all the above anomalies is not enough to guarantee serializability, which is why there are two. We used to think snapshot isolation would prevent read anomalies, until some folks proved it couldn’t. Also, both Oracle and PostgreSQL have at one point or another called snapshot isolation “serializable,” even though it isn’t, and the SQL spec itself seems assume that preventing these three phenomena is equivalent to serializable isolation, even though this is easy to disprove.

So far this seems pedantic, so let’s look at an example. Imagine two empty tables A and B, and two concurrent transactions that insert a count of the rows in the other table:

T1:
INSERT into B count(*) from A;
T2:
INSERT into A count(*) from B;

If these two transactions run at the same time under snapshot isolation, they will both insert a single row with a 0. This makes sense, because each transaction has its own snapshot of the database, in which it sees the other table is empty. However, of the two possible serial orderings [T1,T2], and [T2, T1], neither of them is consistent with what actually happened. So these transactions are not serializable; but they also didn’t suffer any of the three anomalies we’ve defined so far. =(

We’re seeing a new anomaly not defined in the spec: write skew.

Enforcing isolation

The ANSI SQL isolation levels don’t actually prescribe how to implement each level, it only describes “prohibited phenomena.” In practice, most databases use a combination of version tracking and row locking. And most row locking implementations use some variation of reader-writer locks, which are common in many software systems. Readers don’t interfere with other readers, so they don’t block each other, but they block writers. Writers block readers, as well as each other.

Here is a table that shows when locks are needed at each isolation level, when using lock-based concurrency control:

Isolation level Write Operation Read Operation Range Operation
Read Uncommitted Statement Statement Statement
Read Committed Until Commit Statement Statement
Repeatable Read Until Commit Until Commit Statement
Serializable Until Commit Until Commit Until Commit

As you might have guessed, stronger isolation guarantees need more locking, decreasing performance. Note that at the lowest level of isolation, we release locks after every statement, which is very performant, but as we’ve seen, extremely unsafe. At the high end of serializable isolation, every lock we acquire to execute a statement is held until commit. This strategy has a special name: two phase locking.

To guarantee serializability, two phase locking has (you guessed it) two phases: an acquiring phase (releasing locks is not allowed), and a releasing phase (acquiring new locks is not allowed). Strict two phase locking goes further, and waits until the very end to release all locks at once during commit, which is helpful for preventing cascading aborts, although it costs concurrency and is more prone to deadlocking.

The big picture

The point of all this is that transaction isolation in MVCC is not easy to get right. Even at snapshot isolation — the highest[2] isolation level provided by some databases — it’s easy to run into bugs that will silently corrupt your data.

One mindset is to (stick your fingers in your ears and) 1) allow data to be randomly corrupted, or 2) forfeit performance by running your database under true serializable isolation. The other is to solve these problems at the application level, by deliberately structuring your transactions to avoid these problems.

Next up: How to work at read committed / repeatable read isolation and still prevent common anomalies at these levels!

[1] And some databases even treat “repeatable read” as equivalent to “serializable.”
[2] Technically though, snapshot isolation is not a superset of repeatable read, because SI allows anomalies that do not occur under RR. See “A5B Write Skew” in A Critique of of ANSI SQL Isolation Levels.

Database Concurrency, Part 1

This is part one of a series I’ll be writing about database concurrency. Since this first post is a broad overview, I have simplified many concepts here.

High performance databases must support concurrency. As in many other software systems, databases can use read/write locks to maintain consistency under concurrent use (MyISAM in MySQL does this, for example). Conceptually – this is pretty simple: 1) There can be multiple readers. 2) Readers block writers. 3) Writers block each other as well as readers.

Modern database concurrency control has taken this concept pretty far, and protocols like strict 2-phase locking can give you concurrency and strong serializability guarantees. However, any system that depends on this sort of concurrency control suffers an inherent scalability problem: read/write locks prevent readers and writers from running simultaneously – by design. As the volume of reads/writes scale up, you run into situations where you have unnecessary queueing, or either the readers or writers get starved. MyISAM for example prioritizes writes ahead of reads; get enough write volume and your reads will block forever, because write operations will perpetually “cut in line.”

There’s no easy solution here. Prioritize readers ahead of writers? Now you’re going to suffer the opposite problem [1]. Set up a read-only slave? Enjoy dealing with your replication lag. Sharding? That almost makes NoSQL look attractive.

Most reasonably large systems have a lot of read and write transactions going on at once, so its not something we can really sweep under the rug.

MVCC is MVP

An interesting concurrency strategy in modern database systems is called multiversion concurrency control (MVCC). The fundamental insight behind MVCC is that readers and writers in different transactions will never block each other if we allow reality to briefly diverge and converge. Why is this useful?

  • Every transaction starts isolated from every other transaction, they can all pretend they are the only ones running.
  • We can now perform multiple operations and commit them as an all-or-nothing operation, guaranteeing the operations succeed or fail together.
  • We can now read data from a consistent snapshot of the database, even as it continues to change in the background.

Allowing multiple versions of the database to exist simultaneously means we can provide all these guarantees under high concurrency. This is actually pretty incredible, if you think about it. But it’s not all rainbows and sunshine.

“Eventual” consistency? No, sorry, we need a production database

Astute readers will probably realize supporting multiple consistent versions of reality complicates a lot of things that would otherwise be simple. For example, here are just a few complications to account for read queries:

  • You can no longer quickly count the rows in a table. This seems to make many people both confused and angry, because it seems so unbelievably simple. The reality is that MVCC is very, very complicated. Remember, any number of transactions could have made INSERTS or DELETES that are currently invisible to your SELECT query, so the actual count depends on what operations are visible to the current transaction. The database index is no silver bullet because you still have to find and ignore all those pesky invisible rows.
  • In addition, a DELETE statement doesn’t necessarily delete a row, even after you commit. What if there is another transaction that started when that row still existed, and hasn’t finished yet?
  • UPDATE statements don’t update – it writes a new row. The old row has to be kept around for transactions that haven’t seen the update yet, or in case the transaction that wrote the UPDATE rolls back.
  • If you do anything involving range queries, such as SELECT * from accounts where balance > 1000, the database has to do all kinds of acrobatics with range locking, next-key locking, etc to ensure that gosh darnit, no insert or update operation in any other transaction can change this result until the transaction completes.

Which brings me to the elephant in the room: how to reconcile different versions of reality. Because eventually, you’re going to encounter the database equivalent of a merge conflict:

  • What happens if you try to UPDATE a row that a different transaction has updated with more recent data?
  • What is the most recent data? The most recent commit? The most recent uncommitted change?
  • How should constraint violations be handled? For example, what happens if two transactions try to claim the same unique username?

And the worst part of it is, under the default settings in Postgres and MySQL/InnoDB, these anomalies can silently corrupt your data with lost updates (an update gets reverted accidentally by a transaction that never knew about it) or write skew (two transactions read data and then write consistent updates that conflict when merged).

Next up: the different transaction isolation levels available in MVCC. Update: read part 2 here.

[1] Yes, you could use some sort of fairness algorithm, but that still doesn’t solve the queueing problem.

How we sped up our background processing 150x

Performance has always been an obsession of mine. I enjoy the challenge of understanding why things take as long as they do. In the process, I often discover that there’s a way to make things faster by removing bottlenecks. Today I will go over some changes we recently made to Privy that resulted in our production application sending emails 150x faster per node!

Understanding the problem

When we starting exploring performance in our email queueing system, all our nodes were near their maximum memory limit. It was clear that we were running as many workers as we could per machine, but the CPU utilization was extremely low, even when all workers were busy.

Anyone with experience will immediately recognize that this means these systems were almost certainly I/O bound. There’s a couple obvious ways to fix this. One is to perform I/O asynchronously. Since these were already supposed to be asynchronous workers, this didn’t seem intuitively like the right answer.

The other option is to run more workers. But how do you run more workers on a machine already running as many workers as can fit in memory?

Adding more workers

We added more workers per node by moving from Resque to Sidekiq. For those who don’t know, Resque is a process-based background queuing system. Sidekiq, on the other hand, is thread-based. This is important, because Resque’s design means a copy of the application code is duplicated across every one of its worker processes. If we wanted two Resque workers, we would use double the memory of a single worker (because of the copy-on-write nature of forked process memory in linux, this isn’t strictly true, but it was quite close in our production systems due to the memory access patterns of our application and the ruby runtime).

Making this switch to Sidekiq allowed us to immediately increase the number of workers per node by a factor of roughly 6x. All the Sidekiq workers are able to more tightly share operating system resources like memory, network connections, and database access handles.

How did we do?

This one change resulted in a performance change of nearly 30x (as in, 3000% as fast).

Wait, what?

Plot twist!

How did running more workers also result in a performance increase of 500% per worker? I had to do some digging. As it turns out, there’s a number of things that make Resque workers slower:

  • Each worker process forks a child process before starting each job. This takes time, even on a copy-on-write system like linux.
  • Then, since there are now two processes sharing the same connection to redis, the child has to reopen the connection.
  • Now, the parent will have to wait on the child process to exit before it can check the queue for the next job to do.

When we compounded all of these across every worker, it turns out these were, on average, adding a multiple-seconds-long penalty to every job. There is almost certainly something wrong here (and no, it wasn’t paging). I’m sure this could’ve been tuned and improved, but I didn’t explore since it was moot at this point anyway.

Let’s do better – with Computer ScienceTM

In the course of rewriting this system, we noticed some operations were just taking longer than felt right. One of these was the scheduling system: we schedule reminder emails to be sent out in redis itself, inserting jobs into a set that is sorted by time. Sometimes things happen that require removing scheduled emails (for example, if the user performs the action we were trying to nudge them to do).

While profiling the performance of these email reminders, I noticed an odd design: whenever the state of a claimed offer changes (including an email being sent), all related scheduled emails are removed and re-inserted (based on what makes sense for this new state). Obviously, this is a good way to make sure that anything unnecessary is removed without having to know what those things are. I had a hunch: If the scheduled jobs are sorted by time, how long would it take to find jobs that aren’t keyed on time?

O(n). Whoops!

It turns out that the time it took to send an email depended linearly on how many emails were waiting to be sent. This is not a recipe for high scalability.

We did some work to never remove scheduled jobs out of order – instead, scheduled jobs check their validity during runtime and no-op if there is nothing to do. Since no operations depend linearly on the size of the queue any more, its a much more scalable design.

By making this change, we saw an increase in performance of more than 5x in production.

Summing up

  • Moving from process-based to thread-based workers: ~6x more workers per node.
  • Moving from forking workers to non-forking workers: 5x faster.
  • Removing O(n) operations from the actual email send job: 5x faster.
  • Total speedup: Roughly 150x performance improvement.