Browse DevX
Sign up for e-mail newsletters from DevX


Are You Ready for Enterprise Systems That Fix Themselves? : Page 4

Technology to monitor and maintain distributed computer systems is going through a metamorphosis. The dream: computer systems that can quickly heal themselves in response to a wide array of fault and performance conditions, even changing business conditions. Get a realistic appraisal of today's self-healing technology.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

What to Look For
There are many software management solutions in the marketplace today that can easily recognize a fault or performance problem. The next logical technical plateau of management systems is locating, isolating, and diagnosing problems in a distributed environment.

As a CTO or IT manager your first step is to decide whether (and when) the advantages of self-healing are required in your organization and whether they can provide a reasonable return on your investment. The larger the resource infrastructure and the wider the distribution of those resources, the more important it is to consider the benefits of self-managing features. Then you must determine what level of sophistication your requirements demand.

In most legacy corporate IT environments, human brain power is still the predominant method of isolating and fixing network elements that cause failures or performance degradation. A person with sufficient domain knowledge will use hunches, educated guesses, past experience, and the knowledge of co-workers to test hypotheses and deduct failed assumptions. Today, this process has not been extensively duplicated by computers, even through the use of knowledge bases and artificial intelligence. From our experience to date, we can suggest CTOs take their next step toward self-healing systems by looking for vendors with a track record of improving problem location, isolation and diagnosis, while concurrently advancing the state of control architectures like fault-tolerant systems and utility data centers. Experience also suggests:

  • The most basic self-healing system should be expected to locate and isolate faults and performance bottlenecks in mission-critical applications. Performance bottleneck analysis requires the system to have capabilities for tracking a transaction's execution across a distributed environment and find out how much time is spent at each node. Once the problems are located, the system must be able to "drill down" and expose further information about the problem at that location—without unduly taxing performance. An accepted industry heuristic is that management solutions should have less than 5 percent overhead on a managed node.
  • Products from some vendors offer valuable workarounds for more detailed drill-down analysis. Look, for example, to see if there are features that can re-initialize failed nodes or add capacity to bottlenecked nodes.
  • Look for products that include correlation engines. Correlation engines can analyze problems from the bottom-up in addition to top-down by using CPU, I/O, and memory data to create well-known problem sets that can aid diagnosis.
  • Ask the vendor about implementing a pilot solution. The sophisticated technologies used in service-centric self-managing systems are costly and complex, yet justifiable for large enterprises. A vendor should be willing to do the legwork to prove the ROI of the system it wants to sell. Small to mid-sized businesses will likely have simpler needs and require less complexity and more ease of use. These companies typically manage less than 500 nodes and need to isolate fundamental network problems quickly within concentrated, local area networks in just a few geographic locations. As such, solutions that offer limited history, network mapping, and basic network diagnosis, in conjunction with alerting capabilities and some basic performance graphing, will demonstrate a better ROI than more complex management solutions. As the company grows, however, the infrastructure becomes more complex and application-specific management tools will be required.

The operational-management software industry is focusing on problem location, isolation, and diagnosis and making great strides. Self-healing technology is reaching an advanced state. Still, customers should not assume that self-healing systems will find and fix every possible fault and bottleneck because the possibilities are literally astronomical. A dose of reality will insulate you—and your enterprise—against the negative effects of over-excited expectations.

Mike Short is a Technology Planning Manager for HP OpenView. His responsibilities include business planning, transaction management, product development, acquisitions, and new business creation. He spent 12 years as an IT architect working for himself as well as for firms such as EDS and Cable Data Corp. Reach him at mike_short@hp.com.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date