The great promise of DevOps is that organisations can reap the benefits of an automated workflow that takes a developer’s commit from test to production with deliberate speed. Inevitably, problems arise in the process of setting up and continuously improving a DevOps workflow. Some problems are cultural or organisational in nature, but some are technical.
This post outlines three patterns that can be applied when debugging difficult failure modes in a DevOps environment.
You Don’t Need a Debugger for Debugging
While IDEs provide developers with the convenience of an environment that integrates code and debugging tools, there’s nothing that says you can’t inspect running code in a staging or production environment. While deploying an IDE, or a heavy developer-centric package on a live environment can be difficult (or impossible) for operational reasons, there are lightweight CLI tools you can use to aid in the diagnosis of issues, such as a hanging process on a staging system. Tools such as ltrace and SystemTap/DTrace, even plain old lsof can reveal a lot about what’s actually happening. If more visibility into what’s in memory is needed, you can use gcore to cause a running process to generate a core dump without killing it so that it can be subsequently analysed with gdb offline.
In the Java world, tools such as jvmtop leverage the instrumentation capability built inside the virtual machine (VM) to offer a view of the VM’s threads; while jmap and VisualVM can be used to generate and analyse a heap dump, respectively.
While it is frequently useful to practice rubber duck debugging, some failure modes do not lend themselves to a dialectic approach. This is particularly true of intermittent failures seen on a live system the state of which is not fully known. If you find yourself thinking “this shouldn’t happen, but it does”, consider a different approach: aggressive quantification. A spreadsheet program can, in fact, be a debugging tool!
Gather timings, file descriptor counts, event counts, request timestamps, etc. on a variety of environments – not just where the problem is seen. This can be achieved by adding logging or instrumentation to your code or tooling, or by more passive means such as running tshark or polling the information in procfs for certain processes. Once acquired, transform the data into CSV files, import it and plot it as time series and/or as a statistical distribution. Pay attention to the outliers. What else was happening when that bleep occurred? And how does that fit in with your working hypothesis regarding the nature of the issue?
When All You Have Is a Hammer
Got a tricky issue that only occurs very intermittently? A suspicion that there is some sort of race condition between multiple streams of code execution, possibly in different processes or on different systems, that results in things going wrong, but only sometimes? If so, it’s hammer time! Inducing extreme load, or “hammering the system” is an effective way to reproduce these bugs. Try increasing various factors by an order of magnitude or more above what is typically seen in regular integration testing environments. This can artificially increase the period of time during which certain conditions are true, to which other threads or programs might be sensitive. For instance, by repeatedly serialising ten or a hundred times as many objects to/from a backing database, you’ll increase the time during which other DB clients have to wait for their transactions to run, possibly revealing pathological behaviours in the process.
Applying this debugging pattern goes against the natural inclinations of both developers and operations folks, as both would rather see code run at a scale that is supported! That’s precisely what makes it valuable, as it can reveal unconscious assumptions made about the expected behaviour of systems and environments.
By Matt Boyer, Ammeon