The great promise of DevOps is that organisations can reap the benefits of an automated workflow that takes a developer’s commit from test to production with deliberate speed. Inevitably, problems arise in the process of setting up and continuously improving a DevOps workflow. Some problems are cultural or organisational in nature, but some are technical.

This post outlines three patterns that can be applied when debugging difficult failure modes in a DevOps environment.

You Don’t Need a Debugger for Debugging

While IDEs provide developers with the convenience of an environment that integrates code and debugging tools, there’s nothing that says you can’t inspect running code in a staging or production environment. While deploying an IDE, or a heavy developer-centric package on a live environment can be difficult (or impossible) for operational reasons, there are lightweight CLI tools you can use to aid in the diagnosis of issues, such as a hanging process on a staging system. Tools such as ltrace and SystemTap/DTrace, even plain old lsof can reveal a lot about what’s actually happening. If more visibility into what’s in memory is needed, you can use gcore to cause a running process to generate a core dump without killing it so that it can be subsequently analysed with gdb offline.

In the Java world, tools such as jvmtop leverage the instrumentation capability built inside the virtual machine (VM) to offer a view of the VM’s threads; while jmap and VisualVM can be used to generate and analyse a heap dump, respectively.

Quantify It

While it is frequently useful to practice rubber duck debugging, some failure modes do not lend themselves to a dialectic approach. This is particularly true of intermittent failures seen on a live system the state of which is not fully known. If you find yourself thinking “this shouldn’t happen, but it does”, consider a different approach: aggressive quantification. A spreadsheet program can, in fact, be a debugging tool!

Gather timings, file descriptor counts, event counts, request timestamps, etc. on a variety of environments – not just where the problem is seen. This can be achieved by adding logging or instrumentation to your code or tooling, or by more passive means such as running tshark or polling the information in procfs for certain processes. Once acquired, transform the data into CSV files, import it and plot it as time series and/or as a statistical distribution. Pay attention to the outliers. What else was happening when that bleep occurred? And how does that fit in with your working hypothesis regarding the nature of the issue?

When All You Have Is a Hammer

Got a tricky issue that only occurs very intermittently? A suspicion that there is some sort of race condition between multiple streams of code execution, possibly in different processes or on different systems, that results in things going wrong, but only sometimes? If so, it’s hammer time! Inducing extreme load, or “hammering the system” is an effective way to reproduce these bugs. Try increasing various factors by an order of magnitude or more above what is typically seen in regular integration testing environments. This can artificially increase the period of time during which certain conditions are true, to which other threads or programs might be sensitive. For instance, by repeatedly serialising ten or a hundred times as many objects to/from a backing database, you’ll increase the time during which other DB clients have to wait for their transactions to run, possibly revealing pathological behaviours in the process.

Applying this debugging pattern goes against the natural inclinations of both developers and operations folks, as both would rather see code run at a scale that is supported! That’s precisely what makes it valuable, as it can reveal unconscious assumptions made about the expected behaviour of systems and environments.

By Matt Boyer, Ammeon

Creating an application template from an existing application

Ammeon | September 10, 2021

In this blog post we’ll be looking at how to take that running application and create an application template from it.  This will allow the whole application to form a simple repeatable deployment based on a few given parameters. 


Ammeon awarded Container Platform Specialist status by Red Hat

Ammeon | August 27, 2021

Red Hat have awarded us with Container Platform Specialist status. This has been awarded to us for our consistent high standards of Red Hat OpenShift delivery, as well as our specialist expertise and experience that we bring to projects. Ammeon has become one of Red Hat’s leading professional service partners across Europe and our work with OpenShift has been a major reason for this award. We design, build, deploy and manage OpenShift models for our customers across a range of …


How Can Flow Efficiency Improve Productivity

Ammeon | July 4, 2021

Flow efficiency examines the two basic components that make up your lead time: work and wait time.