At Privy, one of our values is pragmatism – so we don’t require formal proofs of correctness and all-du-paths coverage to check in code, because it’s not cost effective (even if those things are valuable in the abstract). But this is such a widely accepted belief that it essentially conveys no information at all; outside of extraordinary operations (like NASA), no one requires 100% path coverage. So how do we determine the what and how much to test?
How do I know if it needs tests?
First, it helps to understand some models we use to form the basis of our intuitions. The most important is the Pareto Principle: the general observation that — most of the time — 80% of a result comes from 20% of the effort. This means we should concentrate our testing on areas that are most likely to make development more efficient, and prevent the majority of the really nasty bugs. Naturally, that means things like complicated but critical core classes, and modules that are emotionally fraught or have difficult-to-untangle side effects like billing and subscription management.
The second is the idea that testing is (among other things) a form of protection from downside risks – lost users/customers, negative press/brand value, bad builds that waste engineer time, etc. Like insurance, you want your protection to be proportionate to the amount of downside risk you are exposed to. A young startup with very few customers/revenue/engineers has, in absolute terms, very little downside. It should act accordingly. Sometimes, at Privy that means we have a lot of code that we haven’t gotten around to testing yet, or even consciously decide that it will not be tested for the foreseeable future.
Third, there is a lot of context to consider. How fault-tolerant are your users? How experienced is your team? Are these variables going to trend up or trend down over time? Generally, a less experienced engineer should write more tests, for the same reason an inexperienced driver should be more deliberate and unfailing in using turn signals. It will also have beneficial side effects, like making it obvious what code is hard to test, and therefore highly likely to be architecturally suspect. On the other end of the spectrum, if you are a star engineer tasked with inventing the modern internet and have six weeks to do it, you will probably decide to skip a unit test here or there.
But enough of that. Now we have a high level mental model to use as a general framework for deciding at whether tests will be useful; below, I’ve put together a list of some finer grained risk factor dimensions that may be useful for evaluating specific modules or classes.
What are the risk factors?
Note that many of these dimensions are not truly independent variables – erring on the side of completeness, there are sure to be some in this list that overlap or are causally related:
Maturity. How mature is the product?
Newer products, services, modules and classes can probably get away with less testing, as a result of the large amount of churn that is likely in the code, and the relatively lower downside risk – fewer users, downstream dependencies, etc.
Impact. If it were to fail, how bad would the result be?
Financial, reputational, or otherwise. Problems arising from defects that are hard or impossible to unwind (e.g., security compromises or data loss) deserve extra scrutiny. Maturity and impact go hand in hand – most young enterprises need to be focused on solving problems and building value, rather than protecting what little they have from downside risks.
Release Cadence. How quickly can an identified defect be resolved in production?
The faster you can deploy changes, the less risk overall that any given bug will have a material impact (some exceptions apply…think security). Continuous integration and deployment with highly effective test coverage is the surest way to have a fast release cadence, making defects in production less impactful.
Downstream dependencies. How many other modules/services depend on it?
More downstream dependencies means more risk, due to the coordination problem. It also implies your interfaces are stable and thus cheaper to test.
Upstream dependencies. How many other modules/services does it depend on?
More unstable upstream dependencies means more risk of breakage. If your dependencies are themselves not well tested, then testing “transitively” might be worthwhile.
Noisiness. If it were to fail, how soon would you notice?
Silent failures that take longer to detect deserve more scrutiny, because it’s usually harder to correct something the more time passes before it is discovered. Logs are rotated out, servers come out of service, repro steps are forgotten, etc.
Churn. How often is the code changing?
The more code is likely to change and be refactored, the more likely bugs will inadvertently be introduced. Conversely, the more interfaces are likely to change and be refactored, the more expensive tests will be to maintain, potentially tipping the cost/benefit equation.
This is not meant to be an indictment of testing. Testing is — and should be — part of what it means to develop software professionally. This is more an attempt to formalize the real tradeoffs we’re balancing every day under heavy pressure, rather than providing a neat set of post-hoc rationalizations.
On the contrary, I think this helps highlight when we are making excuses instead of well-reasoned judgments — I’ve sometimes fallen into the trap of using one criteria from this list to argue for one approach or another, while ignoring another one that didn’t support my position. I hope that by laying out and curating this list over time, we’ll be able to make more balanced and consistent decisions around testing.