Privy Engineering

Building Privy.com

November 25, 2016 Outage Postmortem

On Friday, November 25th, beginning at 1:32PM eastern US time, the Privy.com platform suffered an outage lasting roughly 3 hours.

During this incident, the Privy dashboard was completely unavailable – including signing up for a new account, logging in to an existing account, and managing your campaigns.

A large proportion of campaigns (fluctuating between ~20% and ~70%) failed to load on our customers’ websites. When they did load, they often took up to 30 seconds to do so. Users could opt into these campaigns, but form submissions were slow to process and returned with an error message, even though the submissions were successful. Thank-you pages did not display on a successful signup.

Our email and contact sync systems were unaffected. All successful signups synced to their configured destinations, and emails (autoresponders and scheduled sequences) sent as usual to their recipients.

The proximate cause of this issue was that our database systems were overwhelmed. The engineering team at Privy made preparations for the Black Friday weekend, resulting in roughly 4 times the usual computing resources being available. However, there were a few unanticipated performance problems that became magnified under the stress. In addition, a bug caused by incompatible third party code resulted in a subset of our accounts unintentionally sending up to 20 times more activity data than they should have. Together, these issues generated a workload that our systems could not handle.

Privy engineering immediately investigated these issues and deployed emergency workarounds to restore full availability. By about 4:41PM, our systems began to recover. Today, I am happy to report that all of these workarounds have been removed, and that the identified performance issues have been addressed.

However, in our focus to solve the issue at hand, the engineering team initially failed to communicate the impact, expected time to resolution, and other important details of the incident to our customer support team, which resulted in a lack of details, contradicting information, and an overall frustrating experience for both our support team and customers.

Here are the things we have done to ensure this doesn’t happen again:

  • Updated our incident handling documentation to more quickly identify, communicate, and resolve common problems.
  • Changed our engineering roadmap to ensure that in the future, we can broadcast important news and status updates to our customers, instead of in one-on-one conversations.
  • Significantly improved key bottlenecks in our platform to handle more load concurrently.

Despite all our preparations, we fell short on one of the most important days of the year, and we’ll do everything we can to ensure that this doesn’t happen again. Thank you for using Privy.