This post is a rant about the observability and debuggability of modern software. I’m coming from many years of working on CockroachDB, so I’m biased towards “cloud-native” distributed systems, and SaaS in particular. You’ll guess that I’m not very happy with how we interact with such systems; this motivates the product we’re building at Data Ex Machina.
On software observability
In an old talk that really stayed with me over the years, Bryan Cantrill, one of the authors of DTrace, went metaphysical on the reasons why software observability is fundamentally hard — namely that software doesn’t “look like anything”: it doesn’t emit heat, it doesn’t attract mass. So, if software is not naturally observable, how can observe it? When a program is not working as expected, figuring out why is frequently an ad-hoc and laborious process. If, say, a car engine is misbehaving, you can open the hood and see the physical processes with your naked eyes; no such luck with software.
There are, of course, different ways in which we do get some observability into our programs. We might add different types of instrumentation, which is analogous to baking in some questions that we anticipate having in the future so that the software is able to answer them on demand: what’s the rate of different operations? What are the distributions of latencies of different operations? Has this particular error condition happened recently? What are the stack traces for my threads? We can also get other hints about what the software is doing by observing the environment – CPU usage, memory usage, I/O. I’m talking about metrics, logging, traces and built-in platform instrumentation. All these techniques are suitable for monitoring — verifying that the software continues to work within desired parameters. But observability should be about more than monitoring; it should be about unknown unknowns, about the ability to react and understand whatever comes and service our software in the field.
We can bake in more elaborate questions and answers into your software — custom debug pages, for example. The more you invest in this, the more information you have to work with when observability is needed. But, generally speaking, debugging getting answers to questions takes intelligence, skill and deep expertise. It frequently feels like solving a mystery by going on indirect clues and hunches, when it really should feel more like having a conversation with your system (ask a question, get a straight answer). To torture another car repair analogy, Seinfeld had a bit about your reaction when a car breaks down on the side of the road: you open the hood and hope to see a big on/off switch turned off. In the software world, you open a log file and hope to see an error message printed over and over again. If you see that, then that error probably has something to do with your problem. If you don’t… well, things just got interesting.
When instrumentation fails
So, if your program doesn’t have the right instrumentation offering the information you want, what can you do? Well, if the software is running on your laptop and you’re free to modify it, you might add a couple of printf’s, producing new logs. Then, you recompile, re-run, and hope that you can reproduce whatever you were looking at before. Unfortunately, changing the source code for such quick-and-dirty instrumentation does not work “in production” — it’s generally impractical to release a new version of the software for such modifications. Depending on your organization, releasing a new software version can take between minutes and months. In particular, if you’re shipping software to your users, the update cycles can be very long. Even when you’re hosting all the software yourself, you still need to write a patch, get it approved, trigger a release process which may involve other people, etc. At best, this cycle is discombobulating when you’re in the middle of an investigation. At worst, it’s completely infeasible. Not to mention that instrumentation isn’t free; anticipating the cost of a log statement is not trivial; adding instrumentation for production can be a fraught proposition because you have to consider the risk of logs being too spammy or otherwise expensive.
That’s why engineers hate being on-call or on the escalation path for production systems — they look like fools in front of customers when they’re powerless to figure out what’s going on. When an escalation gets to you, you join a Zoom with 20 angry people who already think you’re incompetent because the software you wrote is not working right and, in my experience, it’s not often that you get to change their minds. If you’re not the first engineer to be brought in, the other poor bastard who called you in has probably promised them that the next person will really know what they’re doing, so you’d better deliver.
What’s truly the killer is the frequent inability to ask even basic questions from your system: you cannot get information out of your program besides what the program was instrumented to offer. So, besides figuring out what questions you want to ask, you also have to figure out what indirect signals you might use to guess the answer. In fact, the “drunk man’s streetlight effect” tends to take over: you have some limited signals so you erroneously focus on them, instead of actually thinking about the problem from first principles. I find that I almost always have a next step in mind for what data I want to gather and what theory I want to test, at least abstractly, but mapping that onto the instrumentation I have at hand can be very hard. A lot of times it feels like it should be mechanical, but instead it’s an art.
Before starting Data Ex Machina, Andrew and I were working on CockroachDB, the distributed SQL database. We have experienced the pain of trying to debug and understand this sophisticated system over many years. Even though CockroachDB has extensive instrumentation built-in (logs, time-series, a built-in tracing system, profilers, dedicated debug pages, telemetry-gathering tools, etc.), too often we couldn’t ask the questions we wanted to ask. As a result, debugging was too damn hard! And, by the way, we’re not thinking strictly about “debugging” as in finding the cause of a bug that we just encountered. Again, observability is about much more than hunting specific problems; it’s about generally understanding your software, testing it, developing a relationship with it, validating theories, getting familiar with a new code base, getting insights into a new workload or environment, etc. Seeing what your software is doing is important and beneficial in a myriad ways. For example, being able to show a colleague how a program is running does wonders for education and for de-mystifying systems. Opaque systems intimidate people and create folklore. Transparent systems foster a healthy community — we should strive for openness.
Thus, what I long wanted is a more natural way of asking an arbitrary question from a program and quickly getting an answer. There used to be a tool for that — the debugger. Alas, debuggers don’t quite apply to modern production environments any more (at least not yet, hint hint). I’ll have more to say on the topic in my next post.
In the meantime, if you’re writing Go and are curious about Side-Eye, install the agent or try our demo sandboxes. Or, if anything here sounded interesting, please get in touch! Join our Discord or write to us at contact@dataexmachina.dev.