Conquering Network Deployment Complexity

An inevitable consequence of the growth in scale of any IT environment is that the effort of managing the infrastructure quickly outpaces the linear growth of the equipment list. Perhaps a version of the ‘network effect’ comes into force – the management complexity rises with something approaching the square of the number of components.

The need to reduce costs in addition to demands on the time of skilled staff eventually make investment in the automation of routine activities more attractive. Another common observation is that as an organization grows, its activities settle into more established and repeatable patterns. Using the 80/20 principle, 20% of these activities typically take up 80% of time and resources.

Growth is a permanent feature of successful companies and with it comes ever-increasing sophistication in how they approach the challenges of system management. Solutions become complex entities in their own right, entailing careful planning and selection.

Over the course of the last ten years we at Ammeon have seen our fair share of complex IT environments, ranging from test-labs with tens of computing nodes, to public networks with many thousands. To give you a flavor of what we do, I’ll outline (over a series of blog posts) an example of a system that automates the management and auditing of a large array of heterogeneous routers and promises to reduce deployment tasks from weeks of manual labor to mere hours.

Overview of the Problem

In this example, a customer with a large array of routers (several thousand), deployed over a similarly large number of sites, was looking to carry out frequent configuration updates. The configuration of each router is unique and changes need to be sensitive to the current status.

The same system should also be used to investigate problems with current network configuration, potentially deploying targeted updates to small sub-groups of the full router set. In some cases, the expectation is to be able to pull data off the routers and to populate it into formatted output, usually so this can be used to provision some other central system.

A variety of different router models and at least two different suppliers are involved; further differences that need to be addressed include firmware and even remote access means.

The first step in developing an optimal solution is to identify any existing technologies which might permit a quicker and more cost-effective solution than a custom development.

Puppet – the Obvious Choice

My first port of call is Puppet. This is a system which allows you to centrally define your IT environment – hardware provisioning, software load, and system configuration – and have it applied and maintained across a set of networked nodes. Once set up, Puppet is automatic, monitoring and enforcing the defined state. Combine it with Cobbler and you can provision bare metal up to a completely installed and running system.

We have used Puppet very successfully as the core configuration management tool for a CI/CD platform and it satisfies a wide range of requirements. Puppet is most closely associated with server management and clearly its strengths lie in this area. It provide some support for network device management since 2.7.0 (released 2011),  hardly surprising given that Cisco is an early investor. Quite recently PuppetLabs (the company that develops both the open source and enterprise editions of Puppet) posted an update on their partnership with Cisco to improve device support.

Puppet fulfills a lot of the requirements, and indeed would not only offer an elegant solution, but would act as a great base to develop further automation and centralization of the infrastructure. However, there are a few limitations caused by the uniqueness of this particular problem that mean Puppet would not be the best fit solution, at this point in time, at least:

  1. The network device support is a little less developed than the server support. Although it started out in 2010, with Cisco support added in 2011, PuppetLabs conceded in their recent update that the support has not been maintained consistently since release.
  2. Our customer makes use of multiple router suppliers, not all of which publish support for Puppet. While Cisco-only support might be interesting in its own right in the long run, at this point in time, customer is not going to consolidate on a single vendor.
  3. The requirements are in some respects a little broader than the goals that Puppet satisfies. The current environment is (by definition) brownfield, meaning there is a large existing deployment, and any changes need to be applied without any impact on service availability or capability. The expectations are not just around centralized configuration, but more complex updates which will depend on the existing router state at the time. Data extraction is an additional requirement.

Disappointingly, we have to examine alternative approaches. On a more general level, there is a valid question as to whether an essentially proprietary approach – albeit based on an open source distribution – is preferable to more standardized, domain-specific initiatives, such as Software Defined Networking (SDN) and Open Flow specifically. I’ll examine (in a future post) whether NETCONF might be a suitable solution.

In my next post, I’ll look at three more options before moving on to the chosen solution.

Tags: , , , , , , ,