A next-generation debugger

12 June 2024

Andrei Matei

If you buy the premise of my previous post about how existing debuggers don’t quite work for modern cloud software, particularly in production environments, what would it take to build a debugger that does work? We’ve been building one for Go called Side-Eye, so we have some ideas. At a low level, half of what makes debuggers tick still applies just fine: using the debug information produced by compilers in order to make sense of a program’s memory (find variable locations, decode structs, etc). The other half needs re-engineering: we need new mechanisms for extracting data from programs without “pausing” them. A traditional debugger will use ptrace to stop the target program and get control, then it will read its memory using other system calls; this is all very slow. Instead of this, we should be using dynamic instrumentation techniques (Side-Eye generates eBPF programs that read the data from the target process’ memory based on the compiler-produced debug information).

The low-level stuff is only part of the story. The rest is about packaging powerful raw capabilities into a product that works at scale and that people actually reach for. So, making GDB or Delve work across multiple processes is not the goal; the goal is to build a tool that is both more powerful and friendlier. The goal is to re-invent the debugger for the cloud age.

When thinking about how such a debugger looks, we need to address the different scale dimensions that are missing from other tools:

The footprint of the target service(s): we have to monitor many processes belonging to possibly multiple services, across many machines. We have to collect data across all of them.
The size of the team: our investigations shouldn’t start with a blinking cursor; they should start from the work that was previously done by us and our colleagues. The data to be collected and how this data should be analyzed and reported should be built over time by different people who focus on the parts of the code they know best. Once collected, data should be shared with the team – no need to paste screenshots on Slack.
Time: similarly, the point is to not start from scratch every time, to acknowledge that a debugging session might span days or even weeks and also to be resilient to the source code evolving.

The first point is the easiest to see: our services run as a collection of process executing on different machines. Moreover, these sets are dynamic as, say, K8s starts and stops pods. Existing debuggers don’t have anything to say about that; they’re narrowly focused on one process or one machine. In contrast, monitoring tools like Prometheus and Grafana have thought about multiple processes from the beginning.

The second point, about the team size, hints at how the debugger or the future cannot be geared towards a single-player experience. We need to acknowledge that the software will be observed by many different people who have expertise in different parts of the system. Here again we should look at a product like Grafana — when you open up Grafana to inspect your services, you do not start from scratch. You start from dashboards that were put together by you and your colleagues over time. These dashboards bring to the forefront high-level metrics or let you deep dive into specific subsystems. The point is that information has been curated and organized over time. Debuggers traditionally have not had such concepts, but they should.

When talking about software versioning, Russ Cox had a nice way of describing software engineering:

Software engineering is what happens to programming
when you add time and other programmers.

That speaks to the importance of keeping software evolution in mind when designing tools. We’re still thinking about what how a debugger should incorporate the idea of time. For one, as the software evolves, the debugger should support you in evolving the prior instrumentation with it, and to grow the body of instrumentation data that can be collected. Separately, “debugging sessions” can also span quite a bit of time, so the debugger could acknowledge that, for example, by having a concept of a list of things that we’re trying to understand currently and letting you ask questions of the form “next time this happens, give me a report that includes …”.

Ask me anything

What makes a debugger a debugger is the general ability to answer questions beyond what the program was instrumented to answer. So, the first thing that our debugger should let you do is “ask questions”. Asking questions should be EZ (and asking questions that we’ve asked before should be even easier). For example, the following questions should all have quick answers:

What are all the queries/requests that are currently executing, across all my servers? The answer could be a report listing the API requests that are currently in flight (these might be HTTP requests, gRPC requests, SQL queries if you’re a database, etc.). Ideally, such a report would also include information about how long each request has been executing for and which user issues the respective request, and it’d allow easy analysis like getting request counts per user, or sorting by duration.
A user request or background job appears to be stuck somewhere in the system; what is its state of execution? The answer to this question could be a backtrace from the thread(s) that is executing that particular request, together with the values of variables relevant to understanding the current state.
A request is blocked trying to acquire a lock; who is holding that lock and why? The answer here would be the ability to navigate between threads that are connected by such dynamic conditions (e.g. lock conflicts).
Requests from one server to another are slow; what is the slow server doing? The answer here would be the ability to navigate between machines and processes (e.g. following an inflight request from a client to a server), and getting a report about the current activity on a server.
Does this code ever run? Or, how often does it run? Or, when this code executes, what are the values of some variables? The answer here would be dynamically injected logs or events emitted when the respective code runs. The respective events would capture local variables or function arguments.
I see a high error rate for a certain operation; I’d like to see an example of a request that resulted in an error. Or, are all the errors triggered by a single user? Here, we’d want to speculatively record all requests as they come in and, later, if the request encounters an error, emit an event. Such an event could contain a full “trace” of the request’s execution.
Some requests take longer than I want; can I have information about the execution of a slow request? This is another example of “tail-based” trace sampling – we’d like to cheaply record some information about the execution of requests, but only output information about requests that prove interesting later on.
One component of my service is misbehaving; can I get detailed logs about what it’s doing? Here, we’d like to instrument the respective component dynamically, so that we only pay the instrumentation cost (slowing down the code, storing the observability data) when we’ve explicitly requested it.

Side-Eye is aiming to do all these, and more. We’ll talk about it more in the future but just briefly: for starters, Side-Eye lets you explore the “current state” of a distributed system through a web application. You can capture “snapshots” of multiple processes at once, where a snapshot includes backtraces of all the goroutines currently alive (both goroutines running on CPU and blocked goroutines), plus select variables and expressions evaluated on different stack frames (like a smarter core dump). Stack frames from different goroutines can be tied together based on data relationships — e.g. a client-side goroutine blocked on a gRPC call to a remote server is linked to the server-side goroutine serving the respective request. The team can iterate over time on selecting what data to be collected and included in snapshots. Side-Eye is meant for servers and parallel programs where a lot of things are happening at the same time, so helping users figure out what to look at within all the goroutines in a snapshot is important. Side-Eye organizes the collected data in tables that can be queried and joined using SQL. Derived tables can be defined, refining the raw data into higher-level “reports”. A snapshot can also be exposed as a Grafana data source so that dashboards can be built over time, summarizing the data and drawing attention to interesting operations / goroutines.

If you’re curious about Side-Eye, install the agent or try our demo sandboxes. Or, if anything here sounded interesting, please join our Discord or write to us at contact@dataexmachina.dev.

Where have all the debuggers gone?

Hello world

Where have all the debuggers gone?

11.06.2024

Debugging modern systems is too damn hard

10.06.2024

Hello world

Install the agent

Log into the web app with your corporate e-mail, get your organization's API token, and install the Side-Eye agent on every machine.

Go to the app Try a demo