The great promise of DevOps is that organisations can reap the benefits of an automated workflow that takes a developer’s commit from test to production with deliberate speed. Inevitably, problems arise in the process of setting up and continuously improving a DevOps workflow. Some problems are cultural or organisational in nature, but some are technical.

This post outlines three patterns that can be applied when debugging difficult failure modes in a DevOps environment.

You Don’t Need a Debugger for Debugging

While IDEs provide developers with the convenience of an environment that integrates code and debugging tools, there’s nothing that says you can’t inspect running code in a staging or production environment. While deploying an IDE, or a heavy developer-centric package on a live environment can be difficult (or impossible) for operational reasons, there are lightweight CLI tools you can use to aid in the diagnosis of issues, such as a hanging process on a staging system. Tools such as ltrace and SystemTap/DTrace, even plain old lsof can reveal a lot about what’s actually happening. If more visibility into what’s in memory is needed, you can use gcore to cause a running process to generate a core dump without killing it so that it can be subsequently analysed with gdb offline.

In the Java world, tools such as jvmtop leverage the instrumentation capability built inside the virtual machine (VM) to offer a view of the VM’s threads; while jmap and VisualVM can be used to generate and analyse a heap dump, respectively.

Quantify It

While it is frequently useful to practice rubber duck debugging, some failure modes do not lend themselves to a dialectic approach. This is particularly true of intermittent failures seen on a live system the state of which is not fully known. If you find yourself thinking “this shouldn’t happen, but it does”, consider a different approach: aggressive quantification. A spreadsheet program can, in fact, be a debugging tool!

Gather timings, file descriptor counts, event counts, request timestamps, etc. on a variety of environments – not just where the problem is seen. This can be achieved by adding logging or instrumentation to your code or tooling, or by more passive means such as running tshark or polling the information in procfs for certain processes. Once acquired, transform the data into CSV files, import it and plot it as time series and/or as a statistical distribution. Pay attention to the outliers. What else was happening when that bleep occurred? And how does that fit in with your working hypothesis regarding the nature of the issue?

When All You Have Is a Hammer

Got a tricky issue that only occurs very intermittently? A suspicion that there is some sort of race condition between multiple streams of code execution, possibly in different processes or on different systems, that results in things going wrong, but only sometimes? If so, it’s hammer time! Inducing extreme load, or “hammering the system” is an effective way to reproduce these bugs. Try increasing various factors by an order of magnitude or more above what is typically seen in regular integration testing environments. This can artificially increase the period of time during which certain conditions are true, to which other threads or programs might be sensitive. For instance, by repeatedly serialising ten or a hundred times as many objects to/from a backing database, you’ll increase the time during which other DB clients have to wait for their transactions to run, possibly revealing pathological behaviours in the process.

Applying this debugging pattern goes against the natural inclinations of both developers and operations folks, as both would rather see code run at a scale that is supported! That’s precisely what makes it valuable, as it can reveal unconscious assumptions made about the expected behaviour of systems and environments.

By Matt Boyer, Ammeon


Ammeon Admin | October 12, 2019

New country, even new continent, 3975.08 Km’s of distance between Cairo and Dublin. This is the distance I had to take to move from my home country and join Ammeon. If you are making such a transition you are asking yourself a lot of questions: there is always the usual risk of changing your work environment by accepting a new job, how will I get along with the new team, management and team leaders? But moving to a different country …



Ammeon Admin | October 1, 2019

How many e-mails have you received this week talking about the “digital transformation” going on everywhere around us? Businesses in any sector grow increasingly paranoid over digitally competent competitors, and/or excited about the possibilities. Google “digital transformation” and you’ll find as many definitions as search hits. The most respectable consultancy firms will be happy to help define your strategy (and take a lot of your money). If there has ever been a good time to talk to the geeks about …


Evolving the 8 Lean Wastes Snake

Ammeon Admin | August 10, 2019

“Snake on the wall” The “Snake on the wall” technique is one that has been used by many Agile and Lean teams in various forms for many years. In its simplest form, the scrum master draws a snake’s head on a post-it on the wall, and as team members run into distractions, impediments and other frictions during the course of their work, they note it on another post-it, and join it to the head … further post-its attach to the …