In order for a self-healing computer system to do its job it must subscribe to one of two approaches for creating diagnoses: it must either learn the appropriate reaction to a stimulus/problem or it must use a pre-defined set of instructions for reacting to a stimulus/problem.
More specifically, the de facto objective among industry contributors seems to be to create an infrastructure that can:
- Define/discover itself and export that definition to external systems; for example the Intelligent Platform Management Interface (IPMI), which defines hardware platform specifications for management, or Web-Based Enterprise Management Initiative of the Distributed Management Task Force (WBEM/CIM), which defines the data model for management data and the APIs for the exchange of this data.
- Detect faults and publish these faults via standard mechanisms; for example, Simple Network Management Protocol (SNMP) or WBEM/CIM.
- Take unattended (i.e. machine automated, without relying on human interaction), yet auditable corrective actions based on either:
faults and performance events "published" by the infrastructure components
new demands (e.g. additional business load)
- Know the system's historical response to new business demand; for example, today statistical baseline models of historical norms at different times of day and/or month are used as reference points for flagging 'out of norm' conditions for corrective actions. As the data updates the statistical model the baseline models grow more refined.
- Replace resources that are defective without prompting; for example, fault-tolerant systems such as Stratus and HP's Himalaya have redundant components that automatically failover to a spare if a primary component fails.
- Adapt to peaks and valleys of demands; for example, workload balancers sense the transaction activity on a multi-resource system and can distribute that workload across those resources as priorities dictate. The tricky part here is the setting of priorities because priorities can change with business circumstances that are not necessarily related to the machine resource infrastructure.
Today's State of the Art
Problems that self-healing systems are designed to solve are classified in two categories: those that cause fault events and those that cause performance events.
Faults normally can be identified and located with well-known correlation techniques. The diagnosis, however, can be challenging due to the fact that exact context for the fault is not always available. Therefore, the state of the art for fault control today avoids root cause diagnosis altogether by simply replacing the failed component with another component. This doesn't work, however, for software that fails due to unexpected data conditions. For these situations, companies such as InCert deliver specialized products that can trace and package up the set of data that led to a failure and then ship it over the Internet to a diagnostic site for analysis.
The state of the art for performance bottlenecks is problem location and isolation. Software is available today that measures real and synthetic transactions across distributed environments to isolate transaction bottlenecks and failures. OpenView Transaction Analyzer (OVTA) from Hewlett-Packard is a good example of transaction management software. Flamenco Networks is another company in the Web services management space that can trace transactions over a distributed environment.
In order to comprehend the complexity of self-healing technology, you must first have a deep understanding of the causes of unplanned application downtime. Independent estimates verify my own experience: 20 percent of all downtime is due to hardware, OS, network, and environmental factors; 40 percent is due to bugs and performance issues at the software layer; and 40 percent is caused by operator errors.
In a typical business environment the number of variables that can affect any of these three categories of error is myriad. Therefore, the full set of possible problems and possible diagnoses rises factorially as opposed to linearly, making autonomous self-healing a challenging vision. Fortunately, some break-through technology exists today and, through research and development, it is being improved and extended to achieve even higher goals of machine-based systems management.