It is more important to understand the problem than the solution.
Decades ago, engineers who wanted to build more resilient networks coined the term “self-healing system” to describe computer networks that would be able to remedy error conditions without human intervention. Since then the self-healing concept has been researched for application in areas such as robotics, control systems, programming languages, software architectures, fault-tolerant computing, and neural networks.
What does it mean? Manufacturers are freely creating their own definitions of “self-healing,” which makes a single overarching definition difficult, but for the purposes of this article self healing is the ability of a software system to adapt at run time to changing user needs, system faults, and resource variability. A functional self-healing system would locate and isolate problems and then execute a remedy. The goal of self-healing computer networks is to be fault-tolerant and high performing.
A worm is the epitome of a self-healing system?cut it in half and the head end will usually survive, regenerating a new tail end for itself. Of course, the more complicated the organism, the less likely it is that “self healing” can be achieved.
Try to imagine the organic equivalent of today’s computer infrastructure. You might imagine it constructed with, say, a pig’s heart, raccoon’s body, duck feet, a turkey’s brain, and a weak immune system?every part is made by a different manufacturer. Clearly, the self-healing gauntlet is not a simple one.
In order for a self-healing computer system to do its job it must subscribe to one of two approaches for creating diagnoses: it must either learn the appropriate reaction to a stimulus/problem or it must use a pre-defined set of instructions for reacting to a stimulus/problem.
More specifically, the de facto objective among industry contributors seems to be to create an infrastructure that can:
- Define/discover itself and export that definition to external systems; for example the Intelligent Platform Management Interface (IPMI), which defines hardware platform specifications for management, or Web-Based Enterprise Management Initiative of the Distributed Management Task Force (WBEM/CIM), which defines the data model for management data and the APIs for the exchange of this data.
- Detect faults and publish these faults via standard mechanisms; for example, Simple Network Management Protocol (SNMP) or WBEM/CIM.
- Take unattended (i.e. machine automated, without relying on human interaction), yet auditable corrective actions based on either:
?faults and performance events “published” by the infrastructure components
?new demands (e.g. additional business load)
- Know the system’s historical response to new business demand; for example, today statistical baseline models of historical norms at different times of day and/or month are used as reference points for flagging ‘out of norm’ conditions for corrective actions. As the data updates the statistical model the baseline models grow more refined.
- Replace resources that are defective without prompting; for example, fault-tolerant systems such as Stratus and HP’s Himalaya have redundant components that automatically failover to a spare if a primary component fails.
- Adapt to peaks and valleys of demands; for example, workload balancers sense the transaction activity on a multi-resource system and can distribute that workload across those resources as priorities dictate. The tricky part here is the setting of priorities because priorities can change with business circumstances that are not necessarily related to the machine resource infrastructure.
Today’s State of the Art
Problems that self-healing systems are designed to solve are classified in two categories: those that cause fault events and those that cause performance events.
Faults normally can be identified and located with well-known correlation techniques. The diagnosis, however, can be challenging due to the fact that exact context for the fault is not always available. Therefore, the state of the art for fault control today avoids root cause diagnosis altogether by simply replacing the failed component with another component. This doesn’t work, however, for software that fails due to unexpected data conditions. For these situations, companies such as InCert deliver specialized products that can trace and package up the set of data that led to a failure and then ship it over the Internet to a diagnostic site for analysis.
The state of the art for performance bottlenecks is problem location and isolation. Software is available today that measures real and synthetic transactions across distributed environments to isolate transaction bottlenecks and failures. OpenView Transaction Analyzer (OVTA) from Hewlett-Packard is a good example of transaction management software. Flamenco Networks is another company in the Web services management space that can trace transactions over a distributed environment.
In order to comprehend the complexity of self-healing technology, you must first have a deep understanding of the causes of unplanned application downtime. Independent estimates verify my own experience: 20 percent of all downtime is due to hardware, OS, network, and environmental factors; 40 percent is due to bugs and performance issues at the software layer; and 40 percent is caused by operator errors.
|Figure 1. Causes of Unplanned Downtime: Each of the three general categories comprises scores of variables, all of which must tracked and analyzed in order for a diagnosis to be made.|
In a typical business environment the number of variables that can affect any of these three categories of error is myriad. Therefore, the full set of possible problems and possible diagnoses rises factorially as opposed to linearly, making autonomous self-healing a challenging vision. Fortunately, some break-through technology exists today and, through research and development, it is being improved and extended to achieve even higher goals of machine-based systems management.
The Building Reality
The next evolutionary step in the self-healing vision is diagnosis, which up until now has been the weak link in most management systems.
Diagnosis involves analyzing and correlating of all critical system activity and changes in state related to a problem and selecting the best corrective option.
Of course, in order to do a diagnosis of an event a system must first do an analysis. Customers are looking for ways to simplify the analysis process and to get just the data they need to help them keep the mission-critical components of their business running?which really means up and performing correctly. To accomplish this requires viewing the threads of execution across a series of components in a distributed environment and then correlating fault and bottleneck location data with detailed CPU, memory and I/O metrics.
Savvy CTOs will recognize the importance of a system that provides “just enough” analysis and diagnosis data to comprehend and prioritize systems and network operations problems. Exposing too much data can be counterproductive. When computer resources are well instrumented and documented?with error information, performance information, and workarounds for every type of problem readily accessible?it is easy to slip into pedantry. Similarly, when critical failures occur, it’s more important to repair the fault quickly than it is to let the business process languish while employees search for the root source of the problem. Businesses need to be able to prioritize issues and a good self-healing system will facilitate that effort rather than work against it.
Ultimately, self-healing will become so advanced that it will achieve “business virtualization,” a term used to describe an infrastructure that evolves intelligently as needed to make the system resistant to faults and performance-related downtime. There are already examples of these kinds of infrastructures in use. A new class of utility data center software and hardware can virtualize the physical resources of the infrastructure by allowing a developer to choose from a menu the hardware, operating system, middleware, and applications in use, which causes all the necessary components for operation and fault and performance management to download. By masking the complexity of building and running n-tier enterprises, business virtualization constitutes a genuine tactical advantage for CTOs and IT departments.
|Figure 2. Toward Self-healing Systems: Most of the current research and development effort around self-healing systems is in the area of location, isolation, and diagnosis. Some technology exists today, but enterprises lack a management system that is fully diagnosis-capable.|
What to Look For
There are many software management solutions in the marketplace today that can easily recognize a fault or performance problem. The next logical technical plateau of management systems is locating, isolating, and diagnosing problems in a distributed environment.
As a CTO or IT manager your first step is to decide whether (and when) the advantages of self-healing are required in your organization and whether they can provide a reasonable return on your investment. The larger the resource infrastructure and the wider the distribution of those resources, the more important it is to consider the benefits of self-managing features. Then you must determine what level of sophistication your requirements demand.
In most legacy corporate IT environments, human brain power is still the predominant method of isolating and fixing network elements that cause failures or performance degradation. A person with sufficient domain knowledge will use hunches, educated guesses, past experience, and the knowledge of co-workers to test hypotheses and deduct failed assumptions. Today, this process has not been extensively duplicated by computers, even through the use of knowledge bases and artificial intelligence. From our experience to date, we can suggest CTOs take their next step toward self-healing systems by looking for vendors with a track record of improving problem location, isolation and diagnosis, while concurrently advancing the state of control architectures like fault-tolerant systems and utility data centers. Experience also suggests:
- The most basic self-healing system should be expected to locate and isolate faults and performance bottlenecks in mission-critical applications. Performance bottleneck analysis requires the system to have capabilities for tracking a transaction’s execution across a distributed environment and find out how much time is spent at each node. Once the problems are located, the system must be able to “drill down” and expose further information about the problem at that location?without unduly taxing performance. An accepted industry heuristic is that management solutions should have less than 5 percent overhead on a managed node.
- Products from some vendors offer valuable workarounds for more detailed drill-down analysis. Look, for example, to see if there are features that can re-initialize failed nodes or add capacity to bottlenecked nodes.
- Look for products that include correlation engines. Correlation engines can analyze problems from the bottom-up in addition to top-down by using CPU, I/O, and memory data to create well-known problem sets that can aid diagnosis.
- Ask the vendor about implementing a pilot solution. The sophisticated technologies used in service-centric self-managing systems are costly and complex, yet justifiable for large enterprises. A vendor should be willing to do the legwork to prove the ROI of the system it wants to sell. Small to mid-sized businesses will likely have simpler needs and require less complexity and more ease of use. These companies typically manage less than 500 nodes and need to isolate fundamental network problems quickly within concentrated, local area networks in just a few geographic locations. As such, solutions that offer limited history, network mapping, and basic network diagnosis, in conjunction with alerting capabilities and some basic performance graphing, will demonstrate a better ROI than more complex management solutions. As the company grows, however, the infrastructure becomes more complex and application-specific management tools will be required.
The operational-management software industry is focusing on problem location, isolation, and diagnosis and making great strides. Self-healing technology is reaching an advanced state. Still, customers should not assume that self-healing systems will find and fix every possible fault and bottleneck because the possibilities are literally astronomical. A dose of reality will insulate you?and your enterprise?against the negative effects of over-excited expectations.