In my previous post, I was ranting about how hard it is to debug a complex system based on the instrumentation we’ve programmed into the software. What I long wanted is a more natural way of asking an arbitrary question from the program; I’d like to quickly get an answer, then ask the next question until we get to the root of the matter. An example of a tool that works like this is a debugger — they have this interactive, conversational mode of use where you converse with the program and iteratively dig deeper into the problem. Crucially, the debugger doesn’t rely on any particular instrumentation being pre-programmed — the debugger can answer questions that the software is not programmed to answer. A debugger used to be considered a basic part of the programmer’s daily toolbox: a programming platform would come with a text editor, a compiler and a debugger. But somehow debuggers got lost along the way. They seem to have stayed stuck in the desktop software era and don’t quite work for the cloud computing / SaaS era. Traditionally, debuggers have a pretty rigid usage model where you attach to one process which you then stop to poke around in its memory. There’s a human in the loop responsible for pausing the program, thinking for a while and typing some commands, then resuming the program and maybe stopping it again later. That feels a bit quaint these days. More and more, the systems we’re debugging are distributed across multiple processes, containers and machines. You can’t “pause” any of the processes for any measurable amount of time; if you did pause them, you’d cause a big disruption to your service, and also you’d quickly destroy the very state that you were trying to observe. Some debuggers, such as GDB or Delve, allow you to script the interactions with the target program; even so these interactions are way too slow. In short, existing debuggers simply don’t apply to production environments.
Alternatives to debuggers
I’m not the first to observe this, of course. Advanced technology for debugging production software has existed in various forms for years. DTrace was invented at Sun Microsystems in 2004, letting people instrument running systems. You’d insert probes at arbitrary locations in your program (and also in the OS kernel) and run some custom code when execution gets to the probe to collect information. You can, for example, see how often a certain function is called and with what arguments. Instead of the classic ways a debugger would stop your program at a breakpoint allowing a user to read its memory, DTrace would “dynamically instrument” a running program by changing the program text on the fly, effectively injecting new code into running programs. This is akin to asking a question and getting an answer – particularly a question about control flow. Although revolutionary, DTrace stayed mostly in the Solaris space. Still, they were really on to something!
In the Linux world, various similar tools were developed over time. In the last few years, eBPF, a new technology1, has taken the Linux world by storm. It allows for the execution of small programs at various hooks in the Linux kernel. It also allows for these programs to be run in response to dynamic probes placed in user space programs, at arbitrary program locations. These probes are programs that can read memory from the process they’re injected into. Notably, it’s all “safe” (i.e. the execution of the host process cannot be affected) and performant. To give a taste of the world of possibilities, one can, for example, insert a program at the beginning of a function serving RPCs or HTTP requests; the probe can inspect the request and, if it matches a filter, it can make a note that the current thread is being “traced”, causing another program inserted at the function’s return point to output the information about the result. eBPF by itself is not an observability solution or product; executing eBPF programs is a low-level capability of the Linux kernel. There are observability tools built on eBPF, such as bpftrace, Pixie, and Polar Signals. Out of these, bpftrace is the closest to a general-purpose debugger, as it allows you to do things like attach to arbitrary functions and read the function arguments. Another venerable dynamic instrumentation tool for Linux is SystemTap. Like bpftrace, SystemTap lets you write scripts that run in response to events generated by the kernel or by user space programs. The scripts can collect and export data, such as variables, by decoding memory according to the target program’s debug information (similar to how a debugger does it). Historically, SystemTap scripts were compiled into a new kernel module that was the loaded into the kernel; more recently there is a eBPF execution backend allowing scripts to be compiled into eBPF programs.
So, there is basic technology out there for bringing debuggers into modernity. But the main sins of bpftrace and SystemTap are that a) they only operate at the level of a single machine, b) they’re not user friendly (you have to script your way to getting the data that you want instead of clicking around), and c) they don’t encourage maintaining any history of the data that you or your colleagues have collected (it’s up to you to save your scripts for reuse somewhere and share them with others). Not to mention that these tools are mostly geared at observing the kernel rather than user space, and are generally focused on C/C++ programs. I think it’s fair to say that they target experts and none of them has truly delivered dynamic instrumentation capabilities to the masses. No tool I’ve seen in this space feels user-friendly enough to use frequently2.
As a result of all these issues, we would almost never reach for dynamic instrumentation tooling when debugging a CockroachDB issue. Which begs the question — what tool would we have used? I’ll talk about that in my next post.
- Technically, BPF was invented in the 90s. In the past few years, though, it has been radically expanded and gained widespread use. ↩︎
- Except in the web applications world. The browser platform, and modern observability tools built for it, are powerful and user-friendly. We should all demand such nice things. ↩︎