Feature Articles: Network Technology for Digital Society of the Future—Toward Advanced, Smart, and Environmentally Friendly Operations
Failure Point Estimation Using Rule-based Learning
NTT Access Network Service Systems Laboratories aims to achieve smart and advanced network operations supporting the digital transformation of the NTT Group. This article introduces a means of failure point estimation using rule-based learning that immediately presents potential failure points at the time of a failure. This estimation technique is based on technology for autonomously deriving cause-and-effect relationships (rules) between failure points and alarms.
Keywords: rule-based learning, failure point estimation, Network-AI
The occurrence of a failure in a large-scale network generates many alarms. A skilled maintenance operator must then analyze the large number of alarms and isolate the failure point by testing or other means. We are researching and developing failure point estimation technology using rule-based learning with the aim of shortening this analysis and troubleshooting work and reducing the burden of carrying out maintenance tasks through prompt failure recovery (Fig. 1). The use of this technology is expected to lead to reduced operating expenses.
2. Failure point estimation using rule-based learning
In this section, we describe the key features of our failure point estimation technology.
2.1 Reduction of operator analysis/troubleshooting work
Failure point estimation using rule-based learning is technology based on decision-making using rules. A rule is an if-then construct that expresses a conclusion derived when a certain condition holds in the form of “if condition then conclusion.” When such rules are applied to network failures, a rule is defined with the if portion designating a combination of events (event group) such as alarms and log information originating in network equipment at the time of a failure, and the then portion designating the cause and location of that failure. When a failure occurs, comparing alarm conditions with such rules enables efficient derivation of points (candidates) in the network causing that failure. A maintenance operator can then mount a response to that failure based on the failure point candidates derived. This reduces the workload associated with time-consuming alarm analysis and troubleshooting-related diagnosis while offering the potential of failure response independent of operator skills.
We constructed a failure point estimation system using rule-based learning with high accuracy by combining this technology with a commercially available rule engine (an engine that performs processing based on if-then rules), as shown in Fig. 2. This system maintains configuration information targeted for management as topology data in a data format that the system can analyze. At the time of a failure in the target environment, the system processes an event group consisting of alarm and log data as input data and presents the operator with the results of estimating failure points based on rules. If no rules corresponding to the current failure case have been registered, the operator can input information on the correct cause of failure through a graphical user interface, thereby saving that case as an example of a past failure ready for rule learning.
Here, rule learning not only serves to add a new rule but also to use added rules as a basis to examine whether all stored failure examples from the past can be used to make correct judgments on current failures. Past failure examples include an event group made up of alarm and log data plus the cause and point of failure for each failure case. Since the know-how of maintenance operators who perform actual failure analysis and troubleshooting is learned in the form of rules, this system can also contribute to the conversion of failure-response actions (operator know-how) into knowledge.
3. Future outlook
Going forward, we plan to study ways of improving the accuracy of failure point estimation by using enhanced learning algorithms and to expand the application scope of the proposed technology.