WO2016026510A1 - Hardware fault identification management in a network - Google Patents

Hardware fault identification management in a network Download PDF

Info

Publication number
WO2016026510A1
WO2016026510A1 PCT/EP2014/067569 EP2014067569W WO2016026510A1 WO 2016026510 A1 WO2016026510 A1 WO 2016026510A1 EP 2014067569 W EP2014067569 W EP 2014067569W WO 2016026510 A1 WO2016026510 A1 WO 2016026510A1
Authority
WO
WIPO (PCT)
Prior art keywords
hardware
data
unit
heuristic
network
Prior art date
Application number
PCT/EP2014/067569
Other languages
French (fr)
Inventor
Sidath Handurukande
Anne-Marie Bosneag
Ming-xue WANG
Original Assignee
Telefonaktiebolaget L M Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget L M Ericsson (Publ) filed Critical Telefonaktiebolaget L M Ericsson (Publ)
Priority to PCT/EP2014/067569 priority Critical patent/WO2016026510A1/en
Publication of WO2016026510A1 publication Critical patent/WO2016026510A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence

Definitions

  • the present invention relates to a method and apparatus for managing hardware fault identification in a network.
  • the present invention also relates to a computer program product configured, when run on a computer, to carry out a method for managing hardware fault identification in a network.
  • Networks including for example telecommunication networks, comprise a range of different hardware devices distributed throughout the coverage area of the network. Faults or failures of such hardware devices have a significant impact on the health and operation of the network and on the provision of network services. The identification and resolution of hardware faults in a network is therefore important for the smooth operation of the network.
  • a common approach to fault identification is the use of real time fault monitoring systems, including for example the fault management function of network management systems such as Operations Support Systems (OSSs).
  • Fault monitoring systems use a range of different mechanisms for identifying hardware faults, including receiving Simple Network Management Protocol (SNMP) trap messages from devices, receiving self diagnostic information from devices, remote monitoring of thresholds, receiving syslog messages etc.
  • SNMP Simple Network Management Protocol
  • the fault monitoring system or a network management system notifies a network operator when a fault is detected by the fault monitoring system.
  • Diagnostics components are able to run hardware checks to verify the health of individual hardware devices, including for example power on self tests, out-service tests, in-service monitoring functions etc. Diagnostics components may be built in to a hardware board or implemented as a separate component. When a fault is identified in the network, diagnostics components are used to help determine the cause of the fault, allowing appropriate remedial action to be taken. Most often, remedial action includes removing and replacing the faulty hardware. The faulty hardware may then be sent to a repair centre or returned to the manufacturer for additional diagnostics and repair.
  • Data associated with the faulty hardware may be sent with the faulty hardware to assist in fault analysis and diagnostics.
  • identifying the root cause of a network problem becomes more difficult.
  • a perceived fault identified within the network may be caused by a hardware component fault, a software component fault, a system configuration fault or a combination of different faults and incompatibilities of the interconnected hardware components.
  • Existing fault monitoring systems are unable to account for this vast range of possible fault causes.
  • a method in a hardware fault management unit, for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit, the method comprising collecting operational data for the hardware devices, analysing the collected data, generating, based on the analysis, a heuristic for hardware device fault identification, and communicating the generated heuristic to at least one of the local hardware diagnostic units.
  • the hardware devices may be distributed throughout the network.
  • the hardware devices may comprise any network hardware device which may be installed or positioned in any location over the geographic coverage area of the network.
  • Local hardware diagnostic units may also be distributed throughout the network, installed or positioned in any location over the geographic coverage area of the network.
  • a local hardware diagnostics unit may be associated with a single hardware device, for example if the local hardware diagnostic unit is implemented in the device.
  • a single local hardware diagnostic unit may be associated with multiple hardware devices, for example if the local hardware diagnostic unit is implemented in a device management unit such as an OSS or Element Management System (EMS).
  • EMS Element Management System
  • operational data is collected for a plurality of hardware devices.
  • data may be collected from the devices themselves and/or from management systems such as an OSS or EMS.
  • a centralised analysis is then conducted in a hardware fault management unit permitting the generation of a heuristic for hardware device fault identification.
  • This heuristic is then disseminated to at least one local hardware diagnostic unit.
  • aspects of the present information thus facilitate fault diagnosis in local hardware diagnostic units on the basis of centralised analysis of data collected for a plurality of hardware devices.
  • a heuristic comprises an experience based technique, and may for example include behaviour patterns, configuration patterns, device models, behaviour descriptors, classifiers etc.
  • the operational data collected for the hardware devices may be pre-processed before it is analysed.
  • analysing the collected data may comprise applying at least one of a data mining, machine learning or analytics algorithm to the collected data.
  • algorithms may for example include text mining algorithms, clustering algorithms, pattern recognition algorithms, classification algorithms etc.
  • operational data may comprise at least one of operational configuration data for the hardware devices, or results data for offsite hardware fault analysis conducted on the hardware devices or on related hardware devices.
  • the hardware devices or related hardware devices on which fault analysis has been conducted may include devices that were returned to a manufacturer or sent to a repair centre. Such devices may include devices which were found to have a hardware fault and devices suspected of having a hardware fault but in which no hardware fault was identified.
  • offsite hardware fault analysis may include any hardware fault analysis conducted in a location different from the deployment location for the device. Such locations may include for example repair and analysis centres, device manufacturing sites or network locations different from the deployment location, for example a network location at which a hardware expert is located.
  • Results data for hardware fault analysis may further comprise device operational data available for the analysed hardware device prior to hardware fault analysis, and may also comprise conclusions drawn from the hardware fault analysis.
  • related hardware devices may comprise devices of the same make and/or model or may comprise devices having the same operational configuration.
  • operational configuration data may comprise at least one of neighbouring device connectivity, operational software information, operational firmware information, or device operational purpose.
  • Software and firmware information may for example include developer information, software identification and version number.
  • Device operational purpose may for example include services delivered over the device according to a particular operational configuration.
  • operational data may comprise at least one of device configuration data, device health data, device performance data, or service performance data for services provided over the device.
  • Device configuration data may for example include, for each device, what the device is, a device identifier, device manufacturer, device software/firmware installed at manufacture etc.
  • Device health data may for example include results of self diagnostic tests and logs conducted by the device.
  • Device performance data may for example include performance information for the device and messages generated by device management units such as an OSS or EMS.
  • Service performance data for services provided over the device may for example include service performance records or customer/engineer feedback.
  • the method may further comprise providing an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit, and communicating the extraction tool to the at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
  • the extraction tool may comprise a software component and may comprise an executable file or executable program (an executable) or a filter.
  • the extraction tool may enable a local hardware diagnostic unit to extract from all the data available to it, that data which is required as an input to the received heuristic.
  • the extraction tool may enable the transformation of data from a form in which it is received at the local hardware diagnostic unit into a form suitable for processing by the received heuristic.
  • providing the extraction tool may comprise a combination of automated processes and input from a human expert. Templates and/or libraries of extraction tools may be tested against the generated heuristic and refined to provide an extraction tool that outputs data of a kind and in a format suitable for input to the generated heuristic.
  • the method may further comprise extracting, from the analysed data, performance statistics for network feedback messages and supplying the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
  • Network feedback messages may in some examples comprise network management or other alarms, hardware messages or log messages which may be sent and received as part of network or device health or performance data.
  • a network feedback tool may comprise any entity or element that manages the underlying logic for such messages and alarms, which logic, according to examples of the present invention, may be refined and improved on the basis of the extracted performance statistics for the network feedback messages.
  • performance statistics may be accuracy statistics for alarms, log messages etc and/or may be other performance information, for example demonstrating under what combinations of circumstances certain alarms or messages have grater or lesser accuracy.
  • generating a heuristic for hardware device fault identification may comprise generating a first version of a heuristic for hardware device fault identification and refining the generated first version heuristic.
  • the first version heuristic may be generated in an entirely automated manner on the basis of the analysed collected data.
  • the first version heuristic may be refined in an automated manner and/or on the basis of input from a human expert.
  • the at least one local hardware diagnostic unit may be at least partially implemented in a device management unit.
  • a device management unit may for example comprise an EMS or an OMS.
  • the at least one local hardware diagnostic unit may be at least partially implemented in a hardware device.
  • a local hardware diagnostic unit may thus be implemented entirely within a device management unit such as an EMS or OSS, entirely within a hardware device or may be shared between the device management unit and the hardware device.
  • the network comprises a telecommunication network.
  • a method in a local hardware diagnostic unit, for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit. The method comprises identifying hardware devices associated with the local hardware diagnostic unit, receiving, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices, receiving at least one of device configuration data, device health data or device performance data for the identified devices, applying the received heuristic to the received data, and outputting a result of the applied heuristic.
  • the result of the applied heuristic may be an indication of the likely hardware health status of the device and may be accompanied by an explanation of basis for the indication and a recommendation as to appropriate action to be taken in light of the indicated likely health status.
  • the recommendation may be device replacement.
  • the recommendation may include additional diagnostic or onsite repair options.
  • the method may further comprise requesting the heuristic corresponding to the identified devices from the hardware fault management unit.
  • the method may further comprise checking the received at least one of device configuration data, device health data or device performance data for a trigger and applying the received heuristic to the received data on detection of the trigger.
  • the received heuristic may be applied only when indicated by certain triggers within the received device data. For example, on receipt of an error message or other message indicating a perceived hardware fault, the local hardware diagnostic unit may run the received heuristic to determine whether the perceived hardware fault is likely to be caused by an actual hardware fault or is likely to be caused by some other issue resulting from the operational configuration of the network.
  • the trigger may comprise an item of received data indicating a perceived hardware fault.
  • the local hardware diagnostic unit may be at least partially implemented in a device management unit.
  • the local hardware diagnostic unit may be at least partially implemented in a hardware device.
  • the network may comprise a telecommunication network.
  • a computer program product configured, when run on a computer, to carry out a method according to any one of the preceding aspects of the invention.
  • a hardware fault management unit configured for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit.
  • the hardware fault management unit comprises a data unit configured to collect operational data for the hardware devices, an analysis unit configured to analyse the collected data, a heuristic unit configured to generate, based on analysis conducted by the analysis unit, a heuristic for hardware device fault identification, and a communication unit configured to communicate the generated heuristic to at least one of the local hardware diagnostic units.
  • the analysis unit may be configured to apply at least one of a data mining, machine learning or analytics algorithm to the data collected by the data unit.
  • the hardware fault management unit may further comprise an extraction unit configured to provide an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit, and the communication unit may be further configured to communicate the extraction tool to the at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
  • the hardware fault management unit may further comprise a performance unit configured to extract, from the analysed data of the analysis unit, performance statistics for network feedback messages, and a feedback unit configured to supply the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
  • the heuristic unit may comprise an automated sub-unit, configured to generate a first version of a heuristic for hardware device fault identification; and a refining sub-unit configured to refining the generated first version heuristic.
  • the refining sub-unit may be managed by a human expert operator, whose input may be incorporated into the refined heuristic.
  • a local hardware diagnostic unit configured for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit.
  • the local hardware diagnostic unit comprises an identifying unit configured to identify hardware devices associated with the local hardware diagnostic unit, a heuristic unit configured to receive, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices, a data unit configured to receive at least one of device configuration data, device health data or device performance data for the identified devices, an analysis unit configured to apply the received heuristic to the received data, and an output unit configured to output a result of the applied heuristic.
  • the analysis unit may further comprise a trigger sub-unit configured to check the data received by the data unit for a trigger, and the analysis unit may be configured to apply the received heuristic to the received data on detection of the trigger by the trigger sub-unit.
  • the local hardware diagnostic unit may be at least partially implemented in a device management unit. According to some examples, the local hardware diagnostic unit may be at least partially implemented in a hardware device.
  • a hardware fault management unit configured for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit.
  • the hardware fault management unit comprises a processor and a memory, the memory containing instructions executable by the processor whereby the hardware fault management unit is operative to collect operational data for the hardware devices, analyse the collected data, generate, based on the analysis, a heuristic for hardware device fault identification, and communicate the generated heuristic to at least one of the local hardware diagnostic units.
  • the hardware fault management unit may be further operative to apply at least one of a data mining, machine learning or analytic algorithm to the collected data.
  • operational data may comprise at least one of operational configuration data for the hardware devices, or results data for offsite hardware fault analysis conducted on the hardware devices or on related hardware devices.
  • operational configuration data may comprise at least one of neighbouring device connectivity, operational software information, operational firmware information, or device operational purpose.
  • device operational purpose may include service provided over the device.
  • operational data may comprise at least one of device configuration data, device health data, device performance data, or service performance data for services provided over the device.
  • the hardware fault management unit may be further operative to provide an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit, and communicate the extraction tool to at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
  • the hardware fault management unit may be further operative to extract, from the analysed data, performance statistics for network feedback messages, and supply the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
  • the hardware fault management unit may be further operative to generate a first version of a heuristic for hardware device fault identification, and refine the generated first version heuristic.
  • the network comprises a telecommunication network.
  • a local hardware diagnostic unit configured for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit.
  • the local hardware diagnostic unit comprises a processor and a memory, the memory containing instructions executable by the processor whereby the local hardware diagnostic unit is operative to identify hardware devices associated with the local hardware diagnostic unit, receive, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices, receive at least one of device configuration data, device health data or device performance data for the identified devices, apply the received heuristic to the received data, and output a result of the applied heuristic.
  • the local hardware diagnostic unit may be further operative to check the received at least one of device configuration data, device health data or device performance data for a trigger, and apply the received heuristic to the received data on detection of the trigger.
  • the trigger may comprise an item or received data indicating a perceived hardware fault.
  • the local hardware diagnostic unit may be at least partially implemented in a device management unit. According to some examples, the local hardware diagnostic unit may be at least partially implemented in a hardware device.
  • the network comprises a telecommunication network.
  • Figure 1 illustrates elements in a network facilitating fault identification management
  • Figure 2 is a flow chart illustrating process steps in a method for managing fault identification in a network
  • Figure 3 is a flow chart illustrating process steps in another example of method for managing fault identification in a network
  • Figure 4 is a block diagram illustrating functionality in a hardware fault management unit
  • Figure 5 is a flow chart illustrating process steps in a method for identifying hardware faults in a network
  • Figure 6 is a flow chart illustrating process steps in another example of method for identifying hardware faults in a network
  • Figure 7 is a block diagram illustrating functionality in a local hardware diagnostic unit
  • Figure 8 is a block diagram illustrating alternative implementations of local hardware diagnostic unit
  • Figure 9 is a block diagram illustrating connectivity of an example local hardware diagnostic unit
  • Figure 10 is a block diagram illustrating functional units in a hardware fault management unit
  • Figure 1 1 is a block diagram illustrating functional units in a local hardware diagnostic unit
  • Figure 12 is a block diagram illustrating functional units in another example of hardware fault management unit
  • Figure 13 is a block diagram illustrating functional units in another example of local hardware diagnostic unit.
  • aspects of the present invention propose a distributed, analytics driven solution to the challenge of hardware fault identification.
  • Methods according to the present invention process a wide range of information which may be collected over the lifetime of hardware devices to develop diagnostics tools permitting accurate identification of hardware faults. Examples of information which may be taken into account in developing hardware diagnostic tools include perceived hardware faults, hardware node configurations, hardware and software versions, details of device interconnection, hardware device return data and test results for analysis of returned hardware devices suspected of being faulty.
  • a hardware fault management unit applies a portfolio of analytics algorithms to identify key patterns and advanced diagnostic information relevant to hardware failure analytics.
  • the output of the hardware fault management unit is communicated in the form of a diagnostic heuristic to a plurality of local hardware diagnostic units, each of which may then use the received heuristic to analyse suspected hardware faults. Being developed on the basis of a wide range of operational data, the received heuristic offers improved accuracy in diagnosing hardware faults than existing fault management systems.
  • the hardware fault management unit applies a portfolio of analytic algorithms to the wide range of operational data collected for the plurality of hardware devices distributed throughout the network.
  • the algorithms may include for example N-gram analysis (text-analytics based), clustering, classification and regression analysis.
  • the hardware fault management unit may also include functionality enabling the receipt of input from a human expert in the form for example of additional analysis code or execution modules.
  • the result of the analysis conducted in the hardware fault management unit is one or more heuristics including evidence based hardware diagnostic descriptors and models.
  • the hardware fault management unit is coupled with a dissemination system, over which such heuristics may be communicated to the plurality of local hardware diagnostic units.
  • the local hardware diagnostic units may then apply the received heuristics to distinguish between perceived hardware faults arising from a genuine hardware problem, and perceived hardware faults that are in fact caused by an issue unrelated to the hardware itself.
  • the local hardware diagnostic units may be distributed throughout the network and may perform diagnostic procedures on hardware devices and network systems suspected of having a fault. When a new suspected fault is detected in the network, a local hardware diagnostic unit may use its received heuristics to determine the likelihood of the perceived fault being linked to the implicated hardware. This information may help operators and technicians to make informed decisions about how to resolve the perceived fault, including when to replace a hardware device and when to attempt to resolve the perceived fault via alternative actions.
  • the analysis of operational data by the hardware fault management unit may yield accuracy statistics and performance insights for network alarms, hardware messages and other log messages generated for example by network management elements such as an OSS or EMS.
  • This accuracy and performance information may be fed back to network field engineers and also to design and development, testing and bug fixing units enabling such units to refine and improve the underlying logic and code that create the network alarms and messages.
  • An example telecommunication network comprises a plurality of hardware devices 2 distributed throughout the network in a range of geographic locations.
  • the hardware devices 2 are managed via EMS/OSS elements 4.
  • Each EMS/OSS 4 is associated with a local hardware diagnostic unit 6.
  • a centralised hardware fault management unit 8 is in communication with the local hardware diagnostic units 6.
  • a data warehouse 10, which may be within or outside the network, receives hardware device data from the EMS/OSS elements 4.
  • the data warehouse 10 also receives information from a repair centre 12 treating returned hardware suspected of being faulty.
  • a human expert at the repair centre may augment the data provided by the repair centre to the data warehouse.
  • Data from the data warehouse 10 is accessible to the hardware fault management unit 8.
  • the hardware fault management unit 8 applies analytics algorithms to data stored in the data warehouse 10, received from OSS/EMS elements 4 and the repair centre 10.
  • the hardware fault management unit 8 may receive certain data directly from the EMS/OMS 4, from the hardware devices 2 and/or from the repair centre 12, without passing through the intermediary of the data warehouse 10.
  • the hardware fault management unit 8 generates heuristics in the form of hardware diagnostic descriptors and models which are disseminated to the local hardware diagnostic units.
  • the generated heuristics may be continuously updated and refined on the basis of newly arriving data. For example, as hardware devices are sent to the repair center 12, new data concerning perceived hardware faults is generated at the repair centre 12 and made available to the hardware fault management unit 8 via the data warehouse 10. This new data may be taken into account by the hardware fault management unit in refining the generated heuristics sent to the local hardware diagnostic units.
  • the repair centre or centres 12 collect data from returned hardware devices, including for example the nature of the hardware device, device performance while in the network, symptoms of unexpected behavior, conclusions as to whether or not the device presented a hardware fault and type of repair action if hardware fault identified.
  • This data is passed to the data warehouse 10, which also receives network alarms and messages and relevant configuration details for hardware devices including network topology and device interconnections, which may be obtained from the relevant EMS/OSS.
  • the hardware fault management unit 8 extracts value from the data available in the data warehouse 10 by conducting data analysis. Examples of data analysis which may be conducted on the data include term & combination of term frequency counts, clustering, regression, classification algorithms, identification of correlations and extractions of heuristics and patterns. This analysis enables the identification of relationships between the different types of data and records in the data warehouse, and hence the generation of one or more heuristics, or experience based techniques, for diagnosing perceived hardware faults. These heuristics may be enhanced by a human operator with knowledge in the domain. The heuristics generated by the hardware fault management unit 8 are then communicated to the local hardware diagnostic units 6, enabling the local hardware diagnostic units 6 to make recommendations based on the data present in the field and the received heuristics. Actions at the hardware fault management unit 8 and local hardware diagnostic units 6 are discussed in detail below.
  • Figure 2 illustrates process steps in a method 100, conducted in a hardware fault management unit 8, for managing hardware fault identification in a network.
  • the network may be a telecommunication network and comprises a plurality of hardware devices 2, each hardware device 2 associated with a local hardware diagnostic unit 6.
  • each hardware device 2 may be associated with a dedicated local hardware diagnostic unit 6.
  • plurality of hardware devices 2 may be associated with a single local hardware diagnostic unit 6.
  • the hardware fault management unit 8 collects operational data for the hardware devices.
  • the operational data collected may comprise a wide range of different types of operational data, examples of which are discussed below.
  • a first example of operational data comprises operational configuration data.
  • This may include, for example, information on operational software and firmware installed on a device for a particular network configuration.
  • the information may include a name and version number of the software and firmware, developer identification and supported platforms or interfaces.
  • Operational configuration data may also include device operational purpose.
  • the operational purpose of a device may vary according to a particular network configuration, as for example different network services are provided over the device. Details of such services may be included in information on the operational purpose of a device.
  • Operational configuration data may also include neighbouring device connectivity. Thus for each device, an indication of connectivity to other devices may be included.
  • Neighbouring device connectivity may be combined with other types of operational configuration data and other types of operational data, such that for example, the manufacturer and operational details of a device, and each device to which it is connected, may be provided.
  • operational configuration data By taking operational configuration data into account in generating diagnostic heuristics, examples of the method 100 permit diagnostic heuristics to be generated on the basis of a whole system view. Diagnosing hardware faults at least in part on the basis of operational configuration data, as opposed merely to device health data as in known systems, represents a broader perspective that allows for more accurate fault identification.
  • the usefulness of operational configuration data may be appreciated by considering the example of a hardware device X in communication with a hardware device Y and a hardware device Z.
  • the hardware device X is configured to receive data from device Y and forward that data to device Z.
  • the network management system for devices X, Y and Z observes that data is being passed to X but is not being received from X at device Z. According to existing fault identification systems, the network management system thus concludes that there is a problem with the hardware device X and initiates device diagnostics and health check.
  • a fault in hardware device X is, however, only one of many possible causes for the failure of data transfer from X to Z. There may for example be an incompatibility in the software operating on devices X and Z.
  • a simple disparity in version number of the software operating on the two devices may render X unable to transmit its data to Z, for example if a later version of the software is not backwards compatible.
  • different devices in the same network will frequently be sourced from different manufacturers, which may lead to compatibility problems between devices. Configuration issues such as these may be identified through consideration of operational configuration data when generating fault diagnostic tools.
  • the fault analysis may be for hardware devices within the network, for example which have previously been returned to a manufacturer or sent to a repair centre and have been re-installed in the network following completed fault analysis and any resulting repairs. Such devices may have been found to be faulty and repaired or may have been falsely identified as faulty and discovered during the offsite analysis to be functioning correctly.
  • the fault analysis may be for related hardware devices, such as devices of the same make and model, or having the same operational configuration as devices now installed in the network.
  • the offsite hardware fault analysis may be conducted in a location different from the deployment location for the device.
  • locations may include for example repair and analysis centres, device manufacturing sites or network locations different from the deployment location, for example a network location at which a hardware expert is located.
  • Results data for hardware fault analysis may comprise details of tests conducted and raw test result data.
  • the results data may also include conclusions drawn concerning the device as well as device operational data available for the analysed hardware device prior to hardware fault analysis.
  • Another example of operation data comprises device configuration data. This may include for example an identification of what a device is, a device identifier, device manufacturer, device software/firmware installed at manufacture etc. Further examples of operational data include device health data, device performance data, and service performance data for services provided over the device.
  • Device health data may include results of self diagnostic tests and logs conducted by the device.
  • Device performance data may for example include performance information for the device and messages generated by device management units such as an OSS or EMS.
  • Service performance data for services provided over the device may for example include service performance records or customer/engineer feedback.
  • the hardware fault management unit 8 proceeds, at step 120, to analyse the collected data and, at step 130 to generate a heuristic for hardware device fault identification based on the analysis.
  • Analysis of the collected data may include pre-processing of the data, for example to place the data into suitable form for analysis.
  • Analysing the data comprises applying one or more data mining, machine learning or analytics algorithms to the data, in order to identify correlations and patterns within the data which relate device data, device monitored and reported behaviour and device operational configuration to actual device hardware malfunction.
  • one or more heuristics is generated enabling hardware device fault identification.
  • the heuristic may be any experience based technique including for example descriptors, models, patterns, classifiers etc.
  • step 150 the generated heuristic is communicated to at least one of the local hardware diagnostic units 6, where it may be used to assist with hardware fault identification and diagnosis, as discussed in further detail below.
  • FIG. 3 illustrates one example method 200, in which the steps of the method 100 may be implemented.
  • the hardware fault management unit 8 collects operational data for hardware devices 2 in the network. As discussed above, this data may be received from various different sources and may include, inter alia, operational configuration data, results data for offsite hardware fault analysis, device configuration data, device health data, device performance data and/or service performance data. Following collection of the data, the hardware fault management unit 8 applies analysis algorithms to the collected data in step 220, and, in step 230a, generates a first version heuristic for hardware fault identification based on the analysis.
  • the generated first version heuristic is then refined in step 230b, for example incorporating input from a human expert operator, as discussed in further detail below.
  • the hardware fault management unit 8 then proceeds, in step 240, to provide an extraction tool for extracting input data for the generated heuristic.
  • An extraction tool may be an executable file or program, (executable) or a filter, or any other tool enabling the extraction of data to serve as input data for the generated heuristic.
  • the amount of data available to a local hardware diagnostic unit 6 may be vast.
  • a unit associated with a plurality of individual hardware devices may receive a wide range of data from the devices themselves as well as from management elements such as an OSS or EMS managing the devices. Only some of this data may be relevant to the generated heuristic, and the generated heuristic may require the data to be input in a form that is different to that in which it is received at the local hardware diagnostic unit.
  • Complex parsing tasks and/or information extraction methods may be necessary to extract the relevant information from the hardware log, alarm log or other data available at the local hardware diagnostic units 6 before any generated heuristic such as pattern matching may be applied.
  • the extraction tool may achieve these tasks, enabling the local hardware diagnostic unit 6 to extract, from the wide range of data available to it, that data which is required as input data for the generated heuristic.
  • the extraction tool may comprise a combination of filter and executable file or program.
  • the filter may enable the isolation of only that data required by the generated heuristic and the executable may transform the filtered data into a form in which it may be processed by the generated heuristic.
  • one or more extraction tools may be necessary to enable the insights gained from central analysis of operational data to be implemented in local hardware diagnostic units.
  • Providing the extraction tool may comprise the steps of identifying input data required by the generated heuristic, identifying what data is available to the local hardware diagnostic unit or units, and selecting one or more extraction tools to generate the identified input data from the identified available data.
  • Selection of a suitable extraction tool or tools may involve consideration of templates or libraries of existing extraction tools and identifying a tool or a combination of tools that are suitable for extracting the identified input data.
  • an event filter may isolate reports of a particular event required for the generated heuristic from all event log data, and an executable file may convert the filtered event data to a form suitable for input to the generated heuristic.
  • providing the extraction tool may be an automated process, or it may include both automated processing and input from a human expert.
  • templates and libraries of extraction tools may be consulted to allow selection of suitable extraction tools as part of an automated generation of a first version extraction tool or tools, and these first version tools may be refined by a human expert.
  • the hardware fault management unit 8 proceeds to communicate the generated heuristic and provided extraction tool to at least one of the local hardware diagnostic units 6 in step 250.
  • the hardware fault management unit 8 may also extract performance statistics at step 260 for network feedback messages from the data analysed in step 220. These performance statistics may be supplied to a network feedback tool in step 270.
  • the network feedback messages may be alarms, log messages or any other type of message generated in the network for example as part of network or device health or performance data.
  • the performance statistics for the network feedback messages may be accuracy statistics for alarms, log messages etc and/or may be other performance information, for example demonstrating under what combinations of circumstances certain alarms or messages have grater or lesser accuracy.
  • FIG. 4 is a block diagram illustrating functionality and data flow, including inputs and outputs, for an example hardware fault management unit 8.
  • the hardware fault management unit 8 receives operational data OD as input from various different sources. These sources may include:
  • Hardware device log files containing lower level device health related information may be received from suspected faulty devices and non-faulty devices for the purposes of comparison.
  • Information relevant to services that depend on the hardware devices for example services delivered over the hardware devices. Such information may be obtained from a Service Performance Monitoring system.
  • Repair centre and manufacturer fault analysis information including relevant device operating information, fault analysis conducted, results obtained and conclusions drawn.
  • This analysis involves the application of algorithms which may include for example text mining algorithms for various text based data.
  • Such data may include hardware log files, alarm and event logs and human descriptions relating to hardware health status such as customer, field engineer and support engineer reports.
  • the text mining algorithms may create term and term-combination frequency measures and association rules as well as identifying correlations, patterns and tools for classifying with a certain probability a data set, for example corresponding to a suspected fault hardware device, as a true faulty hardware situation or a false faulty hardware situation.
  • results of the core hardware data analytics that is the generated heuristic in the form of descriptors, models, patterns, etc. may be placed into interim storage C from which they may be subject to human enrichment input B.
  • the human enrichment input may comprise a combination of automated and human expert refinement during which the generated heuristics are improved and enriched by a combination of automated processing and expert input.
  • Extraction tools such as filters and executables may also be provided to ensure that the correct input data in an acceptable format can be extracted at the local hardware diagnostic units 6 to enable these units 6 to apply the generated heuristics.
  • These filters and executables may be transferred along with the corresponding heuristics to the local hardware diagnostic units 6.
  • the enriched heuristics and extraction tool(s) are returned to interim storage C and passed to final storage D to be downloaded to the local hardware diagnostic units 6.
  • the heuristics and extraction tools may be updated periodically as new data becomes available.
  • the enriched heuristics and extraction tool(s) are passed to dissemination service E, the purpose of which is to facilitate the distribution of the heuristics and extraction tools to the local hardware diagnostic units 6 in the field.
  • the dissemination service E may ensure appropriate authentications, licensing, scalability aspects etc.
  • the core hardware data analytics A also provides performance statistics and insight for network feedback messages which is transmitted at F to appropriate network feedback units FB, including design, development and bug fixing units. This feedback may enable refinement of the software and logic underlying these messages to correct errors and improve accuracy.
  • Figure 5 illustrates process steps in a method 300 for identifying hardware faults in a network, the method conducted in a local hardware diagnostic unit 6.
  • the network may be a telecommunication network and comprises a plurality of hardware devices 2, each hardware device 2 associated with a local hardware diagnostic unit 6.
  • each hardware device 2 may be associated with a dedicated local hardware diagnostic unit 6.
  • plurality of hardware devices 2 may be associated with a single local hardware diagnostic unit 6.
  • the local hardware diagnostic unit 6 identifies hardware devices associated with it. This may be achieved for example by requesting a node list from configuration data held in a management element such as an OSS.
  • the local hardware diagnostic unit 6 then receives, from a hardware fault management unit 8, a heuristic for hardware fault identification corresponding to the identified devices at step 330.
  • the local hardware diagnostic unit 6 receives, at step 340, at least one of device configuration data, device health data or device performance data for the identified devices. Such information is received in the normal course of functioning for example from the devices themselves or from management elements such as an OSS or EMS.
  • the local hardware diagnostic unit 6 then applies the received heuristic to the received data at step 360 and outputs a result of the applied heuristic at step 370.
  • the method 300 thus complements the method 100, applying the heuristic generated according to the method 100 in order to identify hardware faults.
  • FIG. 6 illustrates one example method 400, in which the steps of the method 300 may be implemented.
  • the local hardware diagnostic unit 6 identifies hardware devices associated with it.
  • the local hardware diagnostic unit 6 requests a heuristic corresponding to the identified devices from a hardware fault management unit 8 in step 420.
  • the requested heuristic is received in step 430 and at least one of device configuration data, device health data or device performance data for the identified devices is received at step 440.
  • the local hardware diagnostic unit 6 checks, at step 450, whether a trigger is present in the received data.
  • the trigger may comprise any data item that indicates a suspected or perceived hardware fault or failure. This may include a network alarm or log message or a suspect result from a periodic device health check or any other data element which may raise a question over the operational status or performance of a hardware device. If the trigger is present in the received data, the local hardware diagnostic unit 6 proceeds, at step 460 to apply the received heuristic to the received data for the device corresponding to the trigger and outputs a result of the received heuristic at step 470.
  • Figure 7 is a block diagram illustrating functionality in an example local hardware diagnostic unit 6, including inputs to and outputs from the local hardware diagnostic unit 6.
  • the local hardware diagnostic unit 6 requests a node list from configuration data, for example from a CM database in an OSS. This list may be requested periodically to reflect changes in network configuration and newly installed hardware devices.
  • the local hardware diagnostic unit 6 requests a heuristic corresponding to the hardware devices in the list from a hardware fault management unit 8.
  • the local hardware diagnostic unit 6 receives from the hardware fault management unit 8 one or more heuristics 702, which may include descriptors, models, classifiers etc, together with one or more execution tools, which may include executables and filters, that correspond to the type of devices listed in the node list.
  • the heuristics and extraction tools may be communicated to the local hardware diagnostic unit 6 over the Internet, for example with approval from a network administrator, or may be loaded manually with a specific software update file.
  • the received heuristics and extraction tools serve as a knowledge base for the local hardware diagnostic unit 6 to identify hardware faults amongst the devices with which it is associated.
  • the heuristics and extraction tools are used in conjunction with data for the individual hardware devices to build a set of recommendations regarding the probable hardware health of the devices.
  • the local hardware diagnostic unit 6 applies filters 704, 706, 708 received from the hardware fault identification unit 8 to select appropriate events, counters, alarms and other data from all the events, counters, alarms and other data received from the OSS/EMS and optionally from the individual hardware devices. By filtering out data not required for the received heuristics, the local hardware diagnostic unit 6 ensures that it is not overloaded.
  • Received executables are then applied to obtain data in a form suitable for processing by the received heuristics.
  • the results generated may be presented to a network administrator, engineer, technician or other user of the local hardware diagnostic device 6.
  • the results may fall largely into three categories: Hardware potentially failed; Hardware potentially not failed; Unable to distinguish whether hardware is faulty or non faulty.
  • the basis for the results together with recommended actions may also be presented.
  • alternative causes of the perceived problem triggering the analysis may be presented together with appropriate actions if available.
  • These actions may be extracted from the intelligence obtained through data analysis at the hardware fault management unit. For example, a combination of data may suggest that a hardware device does not have a hardware problem but a perceived problem is more likely to be caused by a software incompatibility.
  • the associated recommendation may therefore be updating or verifying software versions on the device and potentially other connected devices.
  • the local hardware diagnostic device 6 may be implemented in various different ways, and the complexity of the communication between the local hardware diagnostic unit 6 and the individual hardware devices may vary with the practical implementation of the unit.
  • Figure 8 is a block diagram illustrating alternative implementations of local hardware diagnostic unit 6.
  • the local hardware diagnostic unit 6 may be fully implemented within a monitoring element such as an EMS/OSS system 4. In such examples, all logic pertaining to recommendations for particular devices may be contained in the EMS/OSS 4. Communication between the local hardware diagnostic unit 6 and the hardware devices 2 may be conducted via the EMS/OSS.
  • the local hardware diagnostic unit 6 may be shared between the EMS/OSS 4 and the hardware device.
  • One or more sub-modules of the local hardware diagnostic unit 6 may be included as part of the EMS/OSS 4, while another sub-module or sub- modules reside on the individual hardware device 2. This configuration may be appropriate if not all data is available through the interface to the EMS/OSS, some data being necessarily obtained directly from the device. Alternatively this configuration may be adopted for reasons of efficiency.
  • incoming communication to local hardware diagnostic unit sub-module(s) in the EMS/OSS 4 may include data 802 arriving directly from the hardware device and data arriving from the local hardware diagnostic unit sub-module(s) in the hardware device.
  • the local hardware diagnostic unit 6 may be fully implemented within a hardware device. This option may be more appropriate for example if the EMS/OSS execution environment is very limited.
  • the EMS/OSS 4 has the role of forwarding any received heuristics 802 to the hardware device, where the actual recommendations are generated within the local hardware diagnostic unit 6.
  • the local hardware diagnostic unit 6 may be implemented as a part of the firmware of the hardware device.
  • Figure 9 is a block diagram illustrating in greater detail connectivity of the first example configuration discussed above for a local hardware diagnostic unit 6.
  • the local hardware diagnostic unit 6 is fully implemented within the EMS/OSS 4 and communicated via the EMS/OSS 4 with the hardware fault management unit 6 and the individual hardware devices 2.
  • FIG. 10 and 1 1 illustrate functional units in a hardware fault management unit 500 and local hardware diagnostic unit 600, which units may execute the steps of the methods 100, 200, or 300, 400 respectively, for example according to computer readable instructions received from a computer program. It will be understood that the units illustrated in Figures 10 and 1 1 are functional units, and may be realised in any appropriate combination of hardware and/or software.
  • hardware fault management unit 500 comprises a data unit 510, an analysis unit 520, a heuristic unit 530, which may comprise an auto sub unit 530a and a refining sub unit 530b, and a communication unit 550.
  • the hardware fault management unit 500 may also comprise an extraction unit 540, a performance unit 560 and a feedback unit 570.
  • the data unit 510 is configured to collect operational data for the hardware devices.
  • the analysis unit 520 is configured to analyse the collected data for example by applying at least one of a data mining, machine learning or analytics algorithm to the data
  • the heuristic unit 530 is configured to generate, based on analysis conducted by the analysis unit 520, a heuristic for hardware device fault identification.
  • the automated sub-unit 530a is configured to generate a first version of a heuristic for hardware device fault identification and the refining sub-unit 530b is configured to refine the generated first version heuristic, for example incorporating input from a human expert.
  • the communication unit 550 is configured to communicate the generated heuristic to at least one of the local hardware diagnostic units.
  • the extraction unit 540 is configured to provide an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit.
  • the performance unit 560 is configured to extract, from the analysed data of the analysis unit 520, performance statistics for network feedback messages and the feedback unit 570 is configured to supply the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
  • the local hardware diagnostic unit 600 comprises an identifying unit 610, a heuristic unit 630, a data unit 640, an analysis unit 660, which may comprise a trigger sub unit 650, and an output unit 670.
  • the identifying unit 610 is configured to identify hardware devices associated with the local hardware diagnostic unit 600.
  • the heuristic unit 630 is configured to receive, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices.
  • the data unit 640 is configured to receive at least one of device configuration data, device health data or device performance data for the identified devices.
  • the analysis unit 660 is configured to apply the received heuristic to the received data and the output unit 670 is configured to output a result of the applied heuristic.
  • the trigger sub-unit 650 is configured to check the data received by the data unit 640 for a trigger, and the analysis unit 660 may be configured to apply the received heuristic to the received data on detection of the trigger by the trigger sub- unit 650.
  • Figures 12 and 13 illustrate alternative examples of hardware fault management unit 700 and local hardware diagnostic unit 800, which may implement the above discussed functionality for example on receipt of suitable instructions from a computer program.
  • the hardware fault management unit 700 comprises a processor 701 and a memory 702.
  • the memory 702 contains instructions executable by the processor 701 such that the hardware fault management unit 700 is operative to conduct the steps of the method 100 or the method 200.
  • the local hardware diagnostic unit 800 also comprises a processor 801 and a memory 802.
  • the memory 802 contains instructions executable by the processor 801 such that the local hardware diagnostic unit 800 is operative to conduct the steps of the method 300 or the method 400.
  • aspects of the present invention thus facilitate the conducting, at a central location, of advanced analytics functions and algorithms on a wide variety of data sources related to hardware faults and suspected faults.
  • This analysis enables the creation of heuristics such as descriptors, models, patterns, classifiers etc that model, describe or otherwise distinguish true hardware faults from false hardware faults in a network.
  • the heuristics may be associated with one or more extraction tools enabling the generation of suitable machine readable input data for the heuristics from data available to a local hardware diagnostic apparatus.
  • the heuristics, and extraction tools if appropriate, are then disseminated to local hardware diagnostic units distributed throughout the network, enabling such units to perform, on a local level, fault diagnosis which has logic routed in the big data analysis conducted in a central location.
  • the local hardware diagnostic units may be deployed in hardware devices such as a radio basestation in a telecommunication network, or may be deployed in management systems such as an EMS/OSS in a telecommunication network.
  • aspects of the present invention provide considerable advantages over known methods for hardware fault identification and management.
  • aspects of the present invention provide improvements in the decision making process for removing hardware from field deployment, distinguishing with greater accuracy actual hardware failures from perceived hardware failures whose root cause is not related to the hardware itself.
  • true hardware faults can be distinguished with greater accuracy, and faults arising from network configurations or other operational aspects which could not be envisaged at the time of hardware design and manufacture can be accurately identified.
  • Aspects of the present invention thus enable fault identification on a local level to be conducted on the bases of a global system view, taking a holistic approach to fault identification and diagnosis. More informed decisions can be taken on the basis of this analysis as to whether to remove a hardware device which signals a problem for return to the manufacturer or a repair centre, or to perform some other type of local action, for example updating software or firmware.
  • the process of replacing faulty hardware is long, involving the receipt of alarms, taking a decision to replace equipment, a site visit to remove the equipment, shipping to a repair centre or manufacturer, offsite analysis and return. In the case of a falsely identified faulty device, this long process is also unnecessary.
  • false hardware faults can be correctly identified, and the underlying problem addressed, usually by taking much simpler, quicker and cheaper actions such as software upgrades etc.
  • the central analysis of a wide range of data also enables the extraction of performance statistics and other performance insights regarding network alarms, error messages and other log messages. On this basis of such insight, the logic underlying such alarms and messages may be refined to provide improved accuracy.
  • the methods of the present invention may be implemented in hardware, or as software modules running on one or more processors. The methods may also be carried out according to the instructions of a computer program, and the present invention also provides a computer readable medium having stored thereon a program for carrying out any of the methods described herein.
  • a computer program embodying the invention may be stored on a computer-readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal provided from an Internet website, or it could be in any other form.

Abstract

A method (100) for managing hardware fault identification in a network is disclosed, the network comprising a plurality of hardware devices (2), each hardware device associated with a local hardware diagnostic unit (6). The method comprises collecting operational data for the hardware devices (110), analysing the collected data (120), generating, based on the analysis, a heuristic for hardware device fault identification (130), and communicating the generated heuristic to at least one of the local hardware diagnostic units (150). Also disclosed is a method (300) for identifying hardware faults in a network, the method comprising identifying hardware devices associated with a local hardware diagnostic unit (310), receiving, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices (330) and receiving at least one of device configuration data, device health data or device performance data for the identified devices (340). The method further comprises applying the received heuristic to the received data (360), and outputting a result of the applied heuristic (370). A hardware fault management unit (8, 500, 700), a local hardware diagnostic unit (6, 600, 800) and a computer program product are also disclosed.

Description

Hardware Fault Identification Management In A Network Technical Field The present invention relates to a method and apparatus for managing hardware fault identification in a network. The present invention also relates to a computer program product configured, when run on a computer, to carry out a method for managing hardware fault identification in a network. Background
Networks, including for example telecommunication networks, comprise a range of different hardware devices distributed throughout the coverage area of the network. Faults or failures of such hardware devices have a significant impact on the health and operation of the network and on the provision of network services. The identification and resolution of hardware faults in a network is therefore important for the smooth operation of the network.
A common approach to fault identification is the use of real time fault monitoring systems, including for example the fault management function of network management systems such as Operations Support Systems (OSSs). Fault monitoring systems use a range of different mechanisms for identifying hardware faults, including receiving Simple Network Management Protocol (SNMP) trap messages from devices, receiving self diagnostic information from devices, remote monitoring of thresholds, receiving syslog messages etc. The fault monitoring system or a network management system notifies a network operator when a fault is detected by the fault monitoring system.
Most telecommunication networks contain diagnostics components which may assist in both fault identification and fault diagnostics for example for a fault detected via other means. Diagnostics components are able to run hardware checks to verify the health of individual hardware devices, including for example power on self tests, out-service tests, in-service monitoring functions etc. Diagnostics components may be built in to a hardware board or implemented as a separate component. When a fault is identified in the network, diagnostics components are used to help determine the cause of the fault, allowing appropriate remedial action to be taken. Most often, remedial action includes removing and replacing the faulty hardware. The faulty hardware may then be sent to a repair centre or returned to the manufacturer for additional diagnostics and repair. Data associated with the faulty hardware, including alarm and hardware logs, may be sent with the faulty hardware to assist in fault analysis and diagnostics. As telecommunication and other networks become ever larger and more complex, identifying the root cause of a network problem becomes more difficult. A perceived fault identified within the network may be caused by a hardware component fault, a software component fault, a system configuration fault or a combination of different faults and incompatibilities of the interconnected hardware components. Existing fault monitoring systems are unable to account for this vast range of possible fault causes.
Most network diagnostic components focus on individual hardware devices, the health of which can be monitored and assessed using known procedures that are relatively simple to implement. The scale and complexity of the other potentially interconnected issues which could be at the root of a perceived network fault is beyond the compass of such dedicated diagnostic components. Even network management systems are unable to account for the range of possible issues which may occur during the lifetime of a component. The logic used by network management systems to generate hardware related fault indications is designed and implemented at the design and development phase of hardware components. During such design and development, and in the testing phases for the component and the fault identification logic, it is not possible to envisage all potential usage scenarios, operational configurations and interconnections for the device. Increasing pressure to reduce the time-to-market for new components, and the rapid and constant evolution of network architecture and configurations render the task of anticipating hardware issues and designing suitable fault identification logic extremely difficult.
The result of the above discussed limitations in fault identification processes is that a considerable amount of correctly functioning hardware is erroneously identified as faulty; in the absence of systems which can asses the other potential fault causes, a hardware fault is assumed. Such hardware will be replaced and sent for diagnosis and repair by network operators, only to be returned when no hardware fault can be identified. This replacement process may be subject to costly delays for the network operator, in addition to the up front costs of sending hardware for diagnostics and repair. Further, in the case of falsely identified faulty hardware, the root cause of the network problem may remain unidentified, meaning the fault will occur repeatedly despite the costly replacement procedures.
Summary
It is an aim of the present invention to provide a method, apparatus and computer readable medium which at least partially address one or more of the challenges discussed above. According to a first aspect of the present invention, there is provided a method, in a hardware fault management unit, for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit, the method comprising collecting operational data for the hardware devices, analysing the collected data, generating, based on the analysis, a heuristic for hardware device fault identification, and communicating the generated heuristic to at least one of the local hardware diagnostic units.
The hardware devices may be distributed throughout the network. In some examples, the hardware devices may comprise any network hardware device which may be installed or positioned in any location over the geographic coverage area of the network. Local hardware diagnostic units may also be distributed throughout the network, installed or positioned in any location over the geographic coverage area of the network. A local hardware diagnostics unit may be associated with a single hardware device, for example if the local hardware diagnostic unit is implemented in the device. Alternatively, a single local hardware diagnostic unit may be associated with multiple hardware devices, for example if the local hardware diagnostic unit is implemented in a device management unit such as an OSS or Element Management System (EMS).
According to aspects of the present invention, operational data is collected for a plurality of hardware devices. According to examples of the invention, such data may be collected from the devices themselves and/or from management systems such as an OSS or EMS. A centralised analysis is then conducted in a hardware fault management unit permitting the generation of a heuristic for hardware device fault identification. This heuristic is then disseminated to at least one local hardware diagnostic unit. Aspects of the present information thus facilitate fault diagnosis in local hardware diagnostic units on the basis of centralised analysis of data collected for a plurality of hardware devices. According to the present specification, a heuristic comprises an experience based technique, and may for example include behaviour patterns, configuration patterns, device models, behaviour descriptors, classifiers etc.
In some examples of the invention, the operational data collected for the hardware devices may be pre-processed before it is analysed.
According to some examples, analysing the collected data may comprise applying at least one of a data mining, machine learning or analytics algorithm to the collected data. Such algorithms may for example include text mining algorithms, clustering algorithms, pattern recognition algorithms, classification algorithms etc.
According to some examples, operational data may comprise at least one of operational configuration data for the hardware devices, or results data for offsite hardware fault analysis conducted on the hardware devices or on related hardware devices. The hardware devices or related hardware devices on which fault analysis has been conducted may include devices that were returned to a manufacturer or sent to a repair centre. Such devices may include devices which were found to have a hardware fault and devices suspected of having a hardware fault but in which no hardware fault was identified.
In some examples, offsite hardware fault analysis may include any hardware fault analysis conducted in a location different from the deployment location for the device. Such locations may include for example repair and analysis centres, device manufacturing sites or network locations different from the deployment location, for example a network location at which a hardware expert is located. Results data for hardware fault analysis may further comprise device operational data available for the analysed hardware device prior to hardware fault analysis, and may also comprise conclusions drawn from the hardware fault analysis. In some examples, related hardware devices may comprise devices of the same make and/or model or may comprise devices having the same operational configuration. According to some examples, operational configuration data may comprise at least one of neighbouring device connectivity, operational software information, operational firmware information, or device operational purpose. Software and firmware information may for example include developer information, software identification and version number. Device operational purpose may for example include services delivered over the device according to a particular operational configuration.
According to some examples, operational data may comprise at least one of device configuration data, device health data, device performance data, or service performance data for services provided over the device.
Device configuration data may for example include, for each device, what the device is, a device identifier, device manufacturer, device software/firmware installed at manufacture etc. Device health data may for example include results of self diagnostic tests and logs conducted by the device. Device performance data may for example include performance information for the device and messages generated by device management units such as an OSS or EMS. Service performance data for services provided over the device may for example include service performance records or customer/engineer feedback.
In some examples, the method may further comprise providing an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit, and communicating the extraction tool to the at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
In some examples, the extraction tool may comprise a software component and may comprise an executable file or executable program (an executable) or a filter. The extraction tool may enable a local hardware diagnostic unit to extract from all the data available to it, that data which is required as an input to the received heuristic. Alternatively or in addition, the extraction tool may enable the transformation of data from a form in which it is received at the local hardware diagnostic unit into a form suitable for processing by the received heuristic. According to some examples, providing the extraction tool may comprise a combination of automated processes and input from a human expert. Templates and/or libraries of extraction tools may be tested against the generated heuristic and refined to provide an extraction tool that outputs data of a kind and in a format suitable for input to the generated heuristic.
According to some examples, the method may further comprise extracting, from the analysed data, performance statistics for network feedback messages and supplying the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
Network feedback messages may in some examples comprise network management or other alarms, hardware messages or log messages which may be sent and received as part of network or device health or performance data. A network feedback tool may comprise any entity or element that manages the underlying logic for such messages and alarms, which logic, according to examples of the present invention, may be refined and improved on the basis of the extracted performance statistics for the network feedback messages. According to some examples, performance statistics may be accuracy statistics for alarms, log messages etc and/or may be other performance information, for example demonstrating under what combinations of circumstances certain alarms or messages have grater or lesser accuracy. According to some examples, generating a heuristic for hardware device fault identification may comprise generating a first version of a heuristic for hardware device fault identification and refining the generated first version heuristic. In some examples, the first version heuristic may be generated in an entirely automated manner on the basis of the analysed collected data. In some examples, the first version heuristic may be refined in an automated manner and/or on the basis of input from a human expert.
According to some examples, the at least one local hardware diagnostic unit may be at least partially implemented in a device management unit. A device management unit may for example comprise an EMS or an OMS. According to some examples, the at least one local hardware diagnostic unit may be at least partially implemented in a hardware device. According to examples of the invention a local hardware diagnostic unit may thus be implemented entirely within a device management unit such as an EMS or OSS, entirely within a hardware device or may be shared between the device management unit and the hardware device.
According to some examples, the network comprises a telecommunication network. According to another aspect of the present invention, there is provided a method, in a local hardware diagnostic unit, for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit. The method comprises identifying hardware devices associated with the local hardware diagnostic unit, receiving, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices, receiving at least one of device configuration data, device health data or device performance data for the identified devices, applying the received heuristic to the received data, and outputting a result of the applied heuristic.
In some examples, the result of the applied heuristic may be an indication of the likely hardware health status of the device and may be accompanied by an explanation of basis for the indication and a recommendation as to appropriate action to be taken in light of the indicated likely health status. For example, in the case of an indication of likely hardware failure, the recommendation may be device replacement. In the case of an indication of likely correct hardware functioning, and/or if replacement is not necessary, the recommendation may include additional diagnostic or onsite repair options. In some examples, the method may further comprise requesting the heuristic corresponding to the identified devices from the hardware fault management unit.
According to some examples, the method may further comprise checking the received at least one of device configuration data, device health data or device performance data for a trigger and applying the received heuristic to the received data on detection of the trigger. Thus, according to some examples, the received heuristic may be applied only when indicated by certain triggers within the received device data. For example, on receipt of an error message or other message indicating a perceived hardware fault, the local hardware diagnostic unit may run the received heuristic to determine whether the perceived hardware fault is likely to be caused by an actual hardware fault or is likely to be caused by some other issue resulting from the operational configuration of the network.
According to some examples, the trigger may comprise an item of received data indicating a perceived hardware fault. According to some examples, the local hardware diagnostic unit may be at least partially implemented in a device management unit.
According to some examples, the local hardware diagnostic unit may be at least partially implemented in a hardware device.
According to some examples, the network may comprise a telecommunication network.
According to another aspect of the present invention, there is provided a computer program product configured, when run on a computer, to carry out a method according to any one of the preceding aspects of the invention.
According to another aspect of the present invention, there is provided a hardware fault management unit configured for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit. The hardware fault management unit comprises a data unit configured to collect operational data for the hardware devices, an analysis unit configured to analyse the collected data, a heuristic unit configured to generate, based on analysis conducted by the analysis unit, a heuristic for hardware device fault identification, and a communication unit configured to communicate the generated heuristic to at least one of the local hardware diagnostic units.
According to some examples, the analysis unit may be configured to apply at least one of a data mining, machine learning or analytics algorithm to the data collected by the data unit.
According to some examples, the hardware fault management unit may further comprise an extraction unit configured to provide an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit, and the communication unit may be further configured to communicate the extraction tool to the at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
According to some examples, the hardware fault management unit may further comprise a performance unit configured to extract, from the analysed data of the analysis unit, performance statistics for network feedback messages, and a feedback unit configured to supply the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
According to some examples, the heuristic unit may comprise an automated sub-unit, configured to generate a first version of a heuristic for hardware device fault identification; and a refining sub-unit configured to refining the generated first version heuristic. In some examples, the refining sub-unit may be managed by a human expert operator, whose input may be incorporated into the refined heuristic. According to another aspect of the present invention, there is provided a local hardware diagnostic unit configured for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit. The local hardware diagnostic unit comprises an identifying unit configured to identify hardware devices associated with the local hardware diagnostic unit, a heuristic unit configured to receive, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices, a data unit configured to receive at least one of device configuration data, device health data or device performance data for the identified devices, an analysis unit configured to apply the received heuristic to the received data, and an output unit configured to output a result of the applied heuristic.
According to some examples, the analysis unit may further comprise a trigger sub-unit configured to check the data received by the data unit for a trigger, and the analysis unit may be configured to apply the received heuristic to the received data on detection of the trigger by the trigger sub-unit.
According to some examples, the local hardware diagnostic unit may be at least partially implemented in a device management unit. According to some examples, the local hardware diagnostic unit may be at least partially implemented in a hardware device.
According to another aspect of the present invention, there is provided a hardware fault management unit configured for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit. The hardware fault management unit comprises a processor and a memory, the memory containing instructions executable by the processor whereby the hardware fault management unit is operative to collect operational data for the hardware devices, analyse the collected data, generate, based on the analysis, a heuristic for hardware device fault identification, and communicate the generated heuristic to at least one of the local hardware diagnostic units.
According to some examples, the hardware fault management unit may be further operative to apply at least one of a data mining, machine learning or analytic algorithm to the collected data.
According to some examples, operational data may comprise at least one of operational configuration data for the hardware devices, or results data for offsite hardware fault analysis conducted on the hardware devices or on related hardware devices.
According to some examples, operational configuration data may comprise at least one of neighbouring device connectivity, operational software information, operational firmware information, or device operational purpose. In some examples, device operational purpose may include service provided over the device.
According to some examples, operational data may comprise at least one of device configuration data, device health data, device performance data, or service performance data for services provided over the device.
According to some examples, the hardware fault management unit may be further operative to provide an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit, and communicate the extraction tool to at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
According to some examples, the hardware fault management unit may be further operative to extract, from the analysed data, performance statistics for network feedback messages, and supply the extracted performance statistics to a network feedback tool for refinement of the network feedback messages. According to some examples, the hardware fault management unit may be further operative to generate a first version of a heuristic for hardware device fault identification, and refine the generated first version heuristic. According to some examples, the network comprises a telecommunication network.
According to another aspect of the present invention, there is provided a local hardware diagnostic unit configured for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit. The local hardware diagnostic unit comprises a processor and a memory, the memory containing instructions executable by the processor whereby the local hardware diagnostic unit is operative to identify hardware devices associated with the local hardware diagnostic unit, receive, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices, receive at least one of device configuration data, device health data or device performance data for the identified devices, apply the received heuristic to the received data, and output a result of the applied heuristic.
According to some examples, the local hardware diagnostic unit may be further operative to check the received at least one of device configuration data, device health data or device performance data for a trigger, and apply the received heuristic to the received data on detection of the trigger.
According to some examples, the trigger may comprise an item or received data indicating a perceived hardware fault.
According to some examples, the local hardware diagnostic unit may be at least partially implemented in a device management unit. According to some examples, the local hardware diagnostic unit may be at least partially implemented in a hardware device.
According to some examples, the network comprises a telecommunication network. Brief description of the drawings For a better understanding of the present invention, and to show more clearly how it may be carried into effect, reference will now be made, by way of example, to the following drawings in which: Figure 1 illustrates elements in a network facilitating fault identification management;
Figure 2 is a flow chart illustrating process steps in a method for managing fault identification in a network; Figure 3 is a flow chart illustrating process steps in another example of method for managing fault identification in a network;
Figure 4 is a block diagram illustrating functionality in a hardware fault management unit;
Figure 5 is a flow chart illustrating process steps in a method for identifying hardware faults in a network;
Figure 6 is a flow chart illustrating process steps in another example of method for identifying hardware faults in a network;
Figure 7 is a block diagram illustrating functionality in a local hardware diagnostic unit;
Figure 8 is a block diagram illustrating alternative implementations of local hardware diagnostic unit;
Figure 9 is a block diagram illustrating connectivity of an example local hardware diagnostic unit; Figure 10 is a block diagram illustrating functional units in a hardware fault management unit;
Figure 1 1 is a block diagram illustrating functional units in a local hardware diagnostic unit; Figure 12 is a block diagram illustrating functional units in another example of hardware fault management unit; Figure 13 is a block diagram illustrating functional units in another example of local hardware diagnostic unit. Detailed Description
Aspects of the present invention propose a distributed, analytics driven solution to the challenge of hardware fault identification. Methods according to the present invention process a wide range of information which may be collected over the lifetime of hardware devices to develop diagnostics tools permitting accurate identification of hardware faults. Examples of information which may be taken into account in developing hardware diagnostic tools include perceived hardware faults, hardware node configurations, hardware and software versions, details of device interconnection, hardware device return data and test results for analysis of returned hardware devices suspected of being faulty. A hardware fault management unit applies a portfolio of analytics algorithms to identify key patterns and advanced diagnostic information relevant to hardware failure analytics. The output of the hardware fault management unit is communicated in the form of a diagnostic heuristic to a plurality of local hardware diagnostic units, each of which may then use the received heuristic to analyse suspected hardware faults. Being developed on the basis of a wide range of operational data, the received heuristic offers improved accuracy in diagnosing hardware faults than existing fault management systems.
Aspects of the present invention are implemented via a centralised hardware fault management unit and a plurality of local hardware diagnostic units. The hardware fault management unit applies a portfolio of analytic algorithms to the wide range of operational data collected for the plurality of hardware devices distributed throughout the network. The algorithms may include for example N-gram analysis (text-analytics based), clustering, classification and regression analysis. The hardware fault management unit may also include functionality enabling the receipt of input from a human expert in the form for example of additional analysis code or execution modules. The result of the analysis conducted in the hardware fault management unit is one or more heuristics including evidence based hardware diagnostic descriptors and models. The hardware fault management unit is coupled with a dissemination system, over which such heuristics may be communicated to the plurality of local hardware diagnostic units. The local hardware diagnostic units may then apply the received heuristics to distinguish between perceived hardware faults arising from a genuine hardware problem, and perceived hardware faults that are in fact caused by an issue unrelated to the hardware itself. The local hardware diagnostic units may be distributed throughout the network and may perform diagnostic procedures on hardware devices and network systems suspected of having a fault. When a new suspected fault is detected in the network, a local hardware diagnostic unit may use its received heuristics to determine the likelihood of the perceived fault being linked to the implicated hardware. This information may help operators and technicians to make informed decisions about how to resolve the perceived fault, including when to replace a hardware device and when to attempt to resolve the perceived fault via alternative actions.
In addition to facilitating accurate fault diagnosis, the analysis of operational data by the hardware fault management unit may yield accuracy statistics and performance insights for network alarms, hardware messages and other log messages generated for example by network management elements such as an OSS or EMS. This accuracy and performance information may be fed back to network field engineers and also to design and development, testing and bug fixing units enabling such units to refine and improve the underlying logic and code that create the network alarms and messages.
An example overview of information flow and elements which may contribute to hardware fault identification and management according to aspects of the present invention is illustrated in Figure 1. An example telecommunication network comprises a plurality of hardware devices 2 distributed throughout the network in a range of geographic locations. The hardware devices 2 are managed via EMS/OSS elements 4. Each EMS/OSS 4 is associated with a local hardware diagnostic unit 6. A centralised hardware fault management unit 8 is in communication with the local hardware diagnostic units 6. A data warehouse 10, which may be within or outside the network, receives hardware device data from the EMS/OSS elements 4. The data warehouse 10 also receives information from a repair centre 12 treating returned hardware suspected of being faulty. A human expert at the repair centre may augment the data provided by the repair centre to the data warehouse. Data from the data warehouse 10 is accessible to the hardware fault management unit 8. As discussed above, the hardware fault management unit 8 applies analytics algorithms to data stored in the data warehouse 10, received from OSS/EMS elements 4 and the repair centre 10. In alternative configurations, the hardware fault management unit 8 may receive certain data directly from the EMS/OMS 4, from the hardware devices 2 and/or from the repair centre 12, without passing through the intermediary of the data warehouse 10. On the basis of the received data, the hardware fault management unit 8 generates heuristics in the form of hardware diagnostic descriptors and models which are disseminated to the local hardware diagnostic units. The generated heuristics may be continuously updated and refined on the basis of newly arriving data. For example, as hardware devices are sent to the repair center 12, new data concerning perceived hardware faults is generated at the repair centre 12 and made available to the hardware fault management unit 8 via the data warehouse 10. This new data may be taken into account by the hardware fault management unit in refining the generated heuristics sent to the local hardware diagnostic units.
Referring to Figure 1 , the repair centre or centres 12 collect data from returned hardware devices, including for example the nature of the hardware device, device performance while in the network, symptoms of unexpected behavior, conclusions as to whether or not the device presented a hardware fault and type of repair action if hardware fault identified. This data is passed to the data warehouse 10, which also receives network alarms and messages and relevant configuration details for hardware devices including network topology and device interconnections, which may be obtained from the relevant EMS/OSS.
The hardware fault management unit 8 extracts value from the data available in the data warehouse 10 by conducting data analysis. Examples of data analysis which may be conducted on the data include term & combination of term frequency counts, clustering, regression, classification algorithms, identification of correlations and extractions of heuristics and patterns. This analysis enables the identification of relationships between the different types of data and records in the data warehouse, and hence the generation of one or more heuristics, or experience based techniques, for diagnosing perceived hardware faults. These heuristics may be enhanced by a human operator with knowledge in the domain. The heuristics generated by the hardware fault management unit 8 are then communicated to the local hardware diagnostic units 6, enabling the local hardware diagnostic units 6 to make recommendations based on the data present in the field and the received heuristics. Actions at the hardware fault management unit 8 and local hardware diagnostic units 6 are discussed in detail below.
Figure 2 illustrates process steps in a method 100, conducted in a hardware fault management unit 8, for managing hardware fault identification in a network. As discussed above, the network may be a telecommunication network and comprises a plurality of hardware devices 2, each hardware device 2 associated with a local hardware diagnostic unit 6. In some examples, each hardware device 2 may be associated with a dedicated local hardware diagnostic unit 6. In other examples, plurality of hardware devices 2 may be associated with a single local hardware diagnostic unit 6. In a first step 1 10, the hardware fault management unit 8 collects operational data for the hardware devices. The operational data collected may comprise a wide range of different types of operational data, examples of which are discussed below.
A first example of operational data comprises operational configuration data. This may include, for example, information on operational software and firmware installed on a device for a particular network configuration. The information may include a name and version number of the software and firmware, developer identification and supported platforms or interfaces. Operational configuration data may also include device operational purpose. The operational purpose of a device may vary according to a particular network configuration, as for example different network services are provided over the device. Details of such services may be included in information on the operational purpose of a device. Operational configuration data may also include neighbouring device connectivity. Thus for each device, an indication of connectivity to other devices may be included. Neighbouring device connectivity may be combined with other types of operational configuration data and other types of operational data, such that for example, the manufacturer and operational details of a device, and each device to which it is connected, may be provided. By taking operational configuration data into account in generating diagnostic heuristics, examples of the method 100 permit diagnostic heuristics to be generated on the basis of a whole system view. Diagnosing hardware faults at least in part on the basis of operational configuration data, as opposed merely to device health data as in known systems, represents a broader perspective that allows for more accurate fault identification.
The usefulness of operational configuration data may be appreciated by considering the example of a hardware device X in communication with a hardware device Y and a hardware device Z. The hardware device X is configured to receive data from device Y and forward that data to device Z. The network management system for devices X, Y and Z observes that data is being passed to X but is not being received from X at device Z. According to existing fault identification systems, the network management system thus concludes that there is a problem with the hardware device X and initiates device diagnostics and health check. A fault in hardware device X is, however, only one of many possible causes for the failure of data transfer from X to Z. There may for example be an incompatibility in the software operating on devices X and Z. A simple disparity in version number of the software operating on the two devices may render X unable to transmit its data to Z, for example if a later version of the software is not backwards compatible. In addition, different devices in the same network will frequently be sourced from different manufacturers, which may lead to compatibility problems between devices. Configuration issues such as these may be identified through consideration of operational configuration data when generating fault diagnostic tools.
Another example of operational data comprises results data for offsite hardware fault analysis. The fault analysis may be for hardware devices within the network, for example which have previously been returned to a manufacturer or sent to a repair centre and have been re-installed in the network following completed fault analysis and any resulting repairs. Such devices may have been found to be faulty and repaired or may have been falsely identified as faulty and discovered during the offsite analysis to be functioning correctly. Alternatively, the fault analysis may be for related hardware devices, such as devices of the same make and model, or having the same operational configuration as devices now installed in the network.
The offsite hardware fault analysis may be conducted in a location different from the deployment location for the device. Such locations may include for example repair and analysis centres, device manufacturing sites or network locations different from the deployment location, for example a network location at which a hardware expert is located. Results data for hardware fault analysis may comprise details of tests conducted and raw test result data. The results data may also include conclusions drawn concerning the device as well as device operational data available for the analysed hardware device prior to hardware fault analysis. Another example of operation data comprises device configuration data. This may include for example an identification of what a device is, a device identifier, device manufacturer, device software/firmware installed at manufacture etc. Further examples of operational data include device health data, device performance data, and service performance data for services provided over the device. Device health data may include results of self diagnostic tests and logs conducted by the device. Device performance data may for example include performance information for the device and messages generated by device management units such as an OSS or EMS. Service performance data for services provided over the device may for example include service performance records or customer/engineer feedback.
Following collection of operational data for hardware devices in the network, the hardware fault management unit 8 proceeds, at step 120, to analyse the collected data and, at step 130 to generate a heuristic for hardware device fault identification based on the analysis. Analysis of the collected data may include pre-processing of the data, for example to place the data into suitable form for analysis. Analysing the data comprises applying one or more data mining, machine learning or analytics algorithms to the data, in order to identify correlations and patterns within the data which relate device data, device monitored and reported behaviour and device operational configuration to actual device hardware malfunction. On the basis of this analysis, one or more heuristics is generated enabling hardware device fault identification. As discussed above, the heuristic may be any experience based technique including for example descriptors, models, patterns, classifiers etc.
Finally, in step 150, the generated heuristic is communicated to at least one of the local hardware diagnostic units 6, where it may be used to assist with hardware fault identification and diagnosis, as discussed in further detail below.
The steps of the method 100 for managing hardware fault identification in a network may be further subdivided and augmented to provide and support the functionality discussed above. Figure 3 illustrates one example method 200, in which the steps of the method 100 may be implemented. Referring to Figure 2, in a first step 210, the hardware fault management unit 8 collects operational data for hardware devices 2 in the network. As discussed above, this data may be received from various different sources and may include, inter alia, operational configuration data, results data for offsite hardware fault analysis, device configuration data, device health data, device performance data and/or service performance data. Following collection of the data, the hardware fault management unit 8 applies analysis algorithms to the collected data in step 220, and, in step 230a, generates a first version heuristic for hardware fault identification based on the analysis. The generated first version heuristic is then refined in step 230b, for example incorporating input from a human expert operator, as discussed in further detail below. Having generated and refined the heuristic in steps 230a and 230b, the hardware fault management unit 8 then proceeds, in step 240, to provide an extraction tool for extracting input data for the generated heuristic. An extraction tool may be an executable file or program, (executable) or a filter, or any other tool enabling the extraction of data to serve as input data for the generated heuristic.
The amount of data available to a local hardware diagnostic unit 6 may be vast. A unit associated with a plurality of individual hardware devices may receive a wide range of data from the devices themselves as well as from management elements such as an OSS or EMS managing the devices. Only some of this data may be relevant to the generated heuristic, and the generated heuristic may require the data to be input in a form that is different to that in which it is received at the local hardware diagnostic unit. Complex parsing tasks and/or information extraction methods may be necessary to extract the relevant information from the hardware log, alarm log or other data available at the local hardware diagnostic units 6 before any generated heuristic such as pattern matching may be applied. The extraction tool may achieve these tasks, enabling the local hardware diagnostic unit 6 to extract, from the wide range of data available to it, that data which is required as input data for the generated heuristic. In one example, the extraction tool may comprise a combination of filter and executable file or program. The filter may enable the isolation of only that data required by the generated heuristic and the executable may transform the filtered data into a form in which it may be processed by the generated heuristic. In complex and highly heterogeneous networks such as telecommunication networks, one or more extraction tools may be necessary to enable the insights gained from central analysis of operational data to be implemented in local hardware diagnostic units. Providing the extraction tool may comprise the steps of identifying input data required by the generated heuristic, identifying what data is available to the local hardware diagnostic unit or units, and selecting one or more extraction tools to generate the identified input data from the identified available data. Selection of a suitable extraction tool or tools may involve consideration of templates or libraries of existing extraction tools and identifying a tool or a combination of tools that are suitable for extracting the identified input data. For example, an event filter may isolate reports of a particular event required for the generated heuristic from all event log data, and an executable file may convert the filtered event data to a form suitable for input to the generated heuristic. In some examples, providing the extraction tool may be an automated process, or it may include both automated processing and input from a human expert. For example, templates and libraries of extraction tools may be consulted to allow selection of suitable extraction tools as part of an automated generation of a first version extraction tool or tools, and these first version tools may be refined by a human expert.
Having generated both the heuristic for hardware fault identification and an associated extraction tool or tools, the hardware fault management unit 8 proceeds to communicate the generated heuristic and provided extraction tool to at least one of the local hardware diagnostic units 6 in step 250.
In addition to the generation of a heuristic and extraction tool, the hardware fault management unit 8 may also extract performance statistics at step 260 for network feedback messages from the data analysed in step 220. These performance statistics may be supplied to a network feedback tool in step 270. The network feedback messages may be alarms, log messages or any other type of message generated in the network for example as part of network or device health or performance data. The performance statistics for the network feedback messages may be accuracy statistics for alarms, log messages etc and/or may be other performance information, for example demonstrating under what combinations of circumstances certain alarms or messages have grater or lesser accuracy. These statistics are supplied to a network feedback tool which may be any entity or element that manages the underlying logic for such messages and alarms, which logic, according to examples of the present invention, may be refined and improved on the basis of the extracted performance statistics for the network feedback messages. This may include for example hardware, firmware and network management systems design, development and bug fixing units. Figure 4 is a block diagram illustrating functionality and data flow, including inputs and outputs, for an example hardware fault management unit 8. Referring to Figure 4, the hardware fault management unit 8 receives operational data OD as input from various different sources. These sources may include:
Hardware device log files containing lower level device health related information. Such log files may be received from suspected faulty devices and non-faulty devices for the purposes of comparison.
- Alarms and events generated by OSS/EMS that are relevant to hardware performance and failures.
Information relevant to services that depend on the hardware devices, for example services delivered over the hardware devices. Such information may be obtained from a Service Performance Monitoring system.
- Fault descriptions from customers, field engineers, support engineers and hardware device experts.
Repair centre and manufacturer fault analysis information including relevant device operating information, fault analysis conducted, results obtained and conclusions drawn.
- Operational and device configuration information for the hardware devices.
All data input to the hardware fault management unit 8, of which the above are merely examples, is then analysed in core hardware data analytics A. This analysis involves the application of algorithms which may include for example text mining algorithms for various text based data. Such data may include hardware log files, alarm and event logs and human descriptions relating to hardware health status such as customer, field engineer and support engineer reports. The text mining algorithms may create term and term-combination frequency measures and association rules as well as identifying correlations, patterns and tools for classifying with a certain probability a data set, for example corresponding to a suspected fault hardware device, as a true faulty hardware situation or a false faulty hardware situation.
In addition to text-mining algorithms, other algorithms such as clustering, pattern recognition and other machine learning techniques may be used to identify other correlations and patterns enabling accurate hardware fault identification. The results of the core hardware data analytics, that is the generated heuristic in the form of descriptors, models, patterns, etc. may be placed into interim storage C from which they may be subject to human enrichment input B. The human enrichment input may comprise a combination of automated and human expert refinement during which the generated heuristics are improved and enriched by a combination of automated processing and expert input. Extraction tools such as filters and executables may also be provided to ensure that the correct input data in an acceptable format can be extracted at the local hardware diagnostic units 6 to enable these units 6 to apply the generated heuristics. These filters and executables may be transferred along with the corresponding heuristics to the local hardware diagnostic units 6. The enriched heuristics and extraction tool(s) are returned to interim storage C and passed to final storage D to be downloaded to the local hardware diagnostic units 6. The heuristics and extraction tools may be updated periodically as new data becomes available. From final storage D, the enriched heuristics and extraction tool(s) are passed to dissemination service E, the purpose of which is to facilitate the distribution of the heuristics and extraction tools to the local hardware diagnostic units 6 in the field. The dissemination service E may ensure appropriate authentications, licensing, scalability aspects etc. As discussed above, the core hardware data analytics A also provides performance statistics and insight for network feedback messages which is transmitted at F to appropriate network feedback units FB, including design, development and bug fixing units. This feedback may enable refinement of the software and logic underlying these messages to correct errors and improve accuracy.
Figure 5 illustrates process steps in a method 300 for identifying hardware faults in a network, the method conducted in a local hardware diagnostic unit 6. As in the case of methods 100, 200 above conducted in hardware fault management unit 8, the network may be a telecommunication network and comprises a plurality of hardware devices 2, each hardware device 2 associated with a local hardware diagnostic unit 6. In some examples, each hardware device 2 may be associated with a dedicated local hardware diagnostic unit 6. In other examples, plurality of hardware devices 2 may be associated with a single local hardware diagnostic unit 6. Referring to Figure 5, in a first step 310, the local hardware diagnostic unit 6 identifies hardware devices associated with it. This may be achieved for example by requesting a node list from configuration data held in a management element such as an OSS. The local hardware diagnostic unit 6 then receives, from a hardware fault management unit 8, a heuristic for hardware fault identification corresponding to the identified devices at step 330. The local hardware diagnostic unit 6 receives, at step 340, at least one of device configuration data, device health data or device performance data for the identified devices. Such information is received in the normal course of functioning for example from the devices themselves or from management elements such as an OSS or EMS. The local hardware diagnostic unit 6 then applies the received heuristic to the received data at step 360 and outputs a result of the applied heuristic at step 370. The method 300 thus complements the method 100, applying the heuristic generated according to the method 100 in order to identify hardware faults.
As in the case of the method 100 conducted in the hardware fault management unit 8, the steps of the method 300 in the local hardware diagnostic unit 6 may be further subdivided and augmented to provide and support the functionality discussed above. Figure 6 illustrates one example method 400, in which the steps of the method 300 may be implemented. Referring to Figure 6, in a first step 410, the local hardware diagnostic unit 6 identifies hardware devices associated with it. The local hardware diagnostic unit 6 then requests a heuristic corresponding to the identified devices from a hardware fault management unit 8 in step 420. The requested heuristic is received in step 430 and at least one of device configuration data, device health data or device performance data for the identified devices is received at step 440. The local hardware diagnostic unit 6 then checks, at step 450, whether a trigger is present in the received data. The trigger may comprise any data item that indicates a suspected or perceived hardware fault or failure. This may include a network alarm or log message or a suspect result from a periodic device health check or any other data element which may raise a question over the operational status or performance of a hardware device. If the trigger is present in the received data, the local hardware diagnostic unit 6 proceeds, at step 460 to apply the received heuristic to the received data for the device corresponding to the trigger and outputs a result of the received heuristic at step 470.
Figure 7 is a block diagram illustrating functionality in an example local hardware diagnostic unit 6, including inputs to and outputs from the local hardware diagnostic unit 6. Referring to Figure 7, the local hardware diagnostic unit 6 requests a node list from configuration data, for example from a CM database in an OSS. This list may be requested periodically to reflect changes in network configuration and newly installed hardware devices. On the basis of the received node list, the local hardware diagnostic unit 6 requests a heuristic corresponding to the hardware devices in the list from a hardware fault management unit 8. In answer to this request, the local hardware diagnostic unit 6 receives from the hardware fault management unit 8 one or more heuristics 702, which may include descriptors, models, classifiers etc, together with one or more execution tools, which may include executables and filters, that correspond to the type of devices listed in the node list. The heuristics and extraction tools may be communicated to the local hardware diagnostic unit 6 over the Internet, for example with approval from a network administrator, or may be loaded manually with a specific software update file.
The received heuristics and extraction tools serve as a knowledge base for the local hardware diagnostic unit 6 to identify hardware faults amongst the devices with which it is associated. The heuristics and extraction tools are used in conjunction with data for the individual hardware devices to build a set of recommendations regarding the probable hardware health of the devices. On receipt of data, the local hardware diagnostic unit 6 applies filters 704, 706, 708 received from the hardware fault identification unit 8 to select appropriate events, counters, alarms and other data from all the events, counters, alarms and other data received from the OSS/EMS and optionally from the individual hardware devices. By filtering out data not required for the received heuristics, the local hardware diagnostic unit 6 ensures that it is not overloaded. Received executables are then applied to obtain data in a form suitable for processing by the received heuristics. After application of the received heuristics, the results generated may be presented to a network administrator, engineer, technician or other user of the local hardware diagnostic device 6. The results may fall largely into three categories: Hardware potentially failed; Hardware potentially not failed; Unable to distinguish whether hardware is faulty or non faulty. For each category of results, the basis for the results together with recommended actions may also be presented. In addition, for hardware judged probably non faulty, alternative causes of the perceived problem triggering the analysis may be presented together with appropriate actions if available. These actions may be extracted from the intelligence obtained through data analysis at the hardware fault management unit. For example, a combination of data may suggest that a hardware device does not have a hardware problem but a perceived problem is more likely to be caused by a software incompatibility. The associated recommendation may therefore be updating or verifying software versions on the device and potentially other connected devices.
The local hardware diagnostic device 6 may be implemented in various different ways, and the complexity of the communication between the local hardware diagnostic unit 6 and the individual hardware devices may vary with the practical implementation of the unit. Figure 8 is a block diagram illustrating alternative implementations of local hardware diagnostic unit 6. Referring to Figure 8, in a first example, the local hardware diagnostic unit 6 may be fully implemented within a monitoring element such as an EMS/OSS system 4. In such examples, all logic pertaining to recommendations for particular devices may be contained in the EMS/OSS 4. Communication between the local hardware diagnostic unit 6 and the hardware devices 2 may be conducted via the EMS/OSS. In a second example, the local hardware diagnostic unit 6 may be shared between the EMS/OSS 4 and the hardware device. One or more sub-modules of the local hardware diagnostic unit 6 may be included as part of the EMS/OSS 4, while another sub-module or sub- modules reside on the individual hardware device 2. This configuration may be appropriate if not all data is available through the interface to the EMS/OSS, some data being necessarily obtained directly from the device. Alternatively this configuration may be adopted for reasons of efficiency. In such examples, incoming communication to local hardware diagnostic unit sub-module(s) in the EMS/OSS 4 may include data 802 arriving directly from the hardware device and data arriving from the local hardware diagnostic unit sub-module(s) in the hardware device.
In a further example, the local hardware diagnostic unit 6 may be fully implemented within a hardware device. This option may be more appropriate for example if the EMS/OSS execution environment is very limited. The EMS/OSS 4 has the role of forwarding any received heuristics 802 to the hardware device, where the actual recommendations are generated within the local hardware diagnostic unit 6. In such examples, the local hardware diagnostic unit 6 may be implemented as a part of the firmware of the hardware device.
Figure 9 is a block diagram illustrating in greater detail connectivity of the first example configuration discussed above for a local hardware diagnostic unit 6. The local hardware diagnostic unit 6 is fully implemented within the EMS/OSS 4 and communicated via the EMS/OSS 4 with the hardware fault management unit 6 and the individual hardware devices 2.
The functionality discussed above with respect to hardware fault management unit 8 and local hardware diagnostic unit 6 may be implemented on receipt of suitable computer readable instructions, which may be embodied within a computer program running on a network element embodying the hardware fault management unit 8 and local hardware diagnostic unit 6. Figures 10 and 1 1 illustrate functional units in a hardware fault management unit 500 and local hardware diagnostic unit 600, which units may execute the steps of the methods 100, 200, or 300, 400 respectively, for example according to computer readable instructions received from a computer program. It will be understood that the units illustrated in Figures 10 and 1 1 are functional units, and may be realised in any appropriate combination of hardware and/or software.
Referring first to Figure 10, hardware fault management unit 500 comprises a data unit 510, an analysis unit 520, a heuristic unit 530, which may comprise an auto sub unit 530a and a refining sub unit 530b, and a communication unit 550. The hardware fault management unit 500 may also comprise an extraction unit 540, a performance unit 560 and a feedback unit 570. The data unit 510 is configured to collect operational data for the hardware devices. The analysis unit 520 is configured to analyse the collected data for example by applying at least one of a data mining, machine learning or analytics algorithm to the data, and the heuristic unit 530 is configured to generate, based on analysis conducted by the analysis unit 520, a heuristic for hardware device fault identification. Within the heuristic unit 530, the automated sub-unit 530a is configured to generate a first version of a heuristic for hardware device fault identification and the refining sub-unit 530b is configured to refine the generated first version heuristic, for example incorporating input from a human expert. The communication unit 550 is configured to communicate the generated heuristic to at least one of the local hardware diagnostic units.
If present in the hardware fault management unit 500, the extraction unit 540 is configured to provide an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit. The performance unit 560 is configured to extract, from the analysed data of the analysis unit 520, performance statistics for network feedback messages and the feedback unit 570 is configured to supply the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
Referring to Figure 1 1 , the local hardware diagnostic unit 600 comprises an identifying unit 610, a heuristic unit 630, a data unit 640, an analysis unit 660, which may comprise a trigger sub unit 650, and an output unit 670. The identifying unit 610 is configured to identify hardware devices associated with the local hardware diagnostic unit 600. The heuristic unit 630 is configured to receive, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices. The data unit 640 is configured to receive at least one of device configuration data, device health data or device performance data for the identified devices. The analysis unit 660 is configured to apply the received heuristic to the received data and the output unit 670 is configured to output a result of the applied heuristic. If present, the trigger sub-unit 650 is configured to check the data received by the data unit 640 for a trigger, and the analysis unit 660 may be configured to apply the received heuristic to the received data on detection of the trigger by the trigger sub- unit 650.
Figures 12 and 13 illustrate alternative examples of hardware fault management unit 700 and local hardware diagnostic unit 800, which may implement the above discussed functionality for example on receipt of suitable instructions from a computer program. Referring first to Figure 12, the hardware fault management unit 700 comprises a processor 701 and a memory 702. The memory 702 contains instructions executable by the processor 701 such that the hardware fault management unit 700 is operative to conduct the steps of the method 100 or the method 200.
Referring to Figure 13, the local hardware diagnostic unit 800 also comprises a processor 801 and a memory 802. The memory 802 contains instructions executable by the processor 801 such that the local hardware diagnostic unit 800 is operative to conduct the steps of the method 300 or the method 400.
Aspects of the present invention thus facilitate the conducting, at a central location, of advanced analytics functions and algorithms on a wide variety of data sources related to hardware faults and suspected faults. This analysis enables the creation of heuristics such as descriptors, models, patterns, classifiers etc that model, describe or otherwise distinguish true hardware faults from false hardware faults in a network. The heuristics may be associated with one or more extraction tools enabling the generation of suitable machine readable input data for the heuristics from data available to a local hardware diagnostic apparatus. The heuristics, and extraction tools if appropriate, are then disseminated to local hardware diagnostic units distributed throughout the network, enabling such units to perform, on a local level, fault diagnosis which has logic routed in the big data analysis conducted in a central location. The local hardware diagnostic units may be deployed in hardware devices such as a radio basestation in a telecommunication network, or may be deployed in management systems such as an EMS/OSS in a telecommunication network.
Aspects of the present invention provide considerable advantages over known methods for hardware fault identification and management. In a first example, aspects of the present invention provide improvements in the decision making process for removing hardware from field deployment, distinguishing with greater accuracy actual hardware failures from perceived hardware failures whose root cause is not related to the hardware itself. On the basis of centrally performed analytics which encompass a much wider range of data than is conventionally considered, true hardware faults can be distinguished with greater accuracy, and faults arising from network configurations or other operational aspects which could not be envisaged at the time of hardware design and manufacture can be accurately identified. Aspects of the present invention thus enable fault identification on a local level to be conducted on the bases of a global system view, taking a holistic approach to fault identification and diagnosis. More informed decisions can be taken on the basis of this analysis as to whether to remove a hardware device which signals a problem for return to the manufacturer or a repair centre, or to perform some other type of local action, for example updating software or firmware.
It is not feasible for engineers and technicians to be familiar with the functionality and possible issues associated with each type of hardware device and with each possible configuration of hardware device within the network. Aspects of the present information address this difficulty by analyzing existing data from a wide range of sources in a central location, and then disseminating experience based fault identification and diagnosis techniques based on this analysis to local diagnostic units. The results of these techniques greatly enhance the information available to an engineer, technician or network administrator and so improve the decision making process. As a consequence of the improved decision making regarding identification and treatment of suspected faulty hardware, the number of falsely identified faulty hardware devices may be reduced. This reduces the number of devices unnecessarily sent to a manufacturer or repair centre for fault analysis and repair, reducing cost and cutting the time delay associated with hardware failure analysis and returns. The process of replacing faulty hardware is long, involving the receipt of alarms, taking a decision to replace equipment, a site visit to remove the equipment, shipping to a repair centre or manufacturer, offsite analysis and return. In the case of a falsely identified faulty device, this long process is also unnecessary. Through the improved identification of hardware faults provided by aspects of the invention, false hardware faults can be correctly identified, and the underlying problem addressed, usually by taking much simpler, quicker and cheaper actions such as software upgrades etc. The central analysis of a wide range of data also enables the extraction of performance statistics and other performance insights regarding network alarms, error messages and other log messages. On this basis of such insight, the logic underlying such alarms and messages may be refined to provide improved accuracy. The methods of the present invention may be implemented in hardware, or as software modules running on one or more processors. The methods may also be carried out according to the instructions of a computer program, and the present invention also provides a computer readable medium having stored thereon a program for carrying out any of the methods described herein. A computer program embodying the invention may be stored on a computer-readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal provided from an Internet website, or it could be in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.

Claims

1 . A method, in a hardware fault management unit, for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit, the method comprising:
collecting operational data for the hardware devices;
analysing the collected data;
generating, based on the analysis, a heuristic for hardware device fault identification; and
communicating the generated heuristic to at least one of the local hardware diagnostic units.
2. A method as claimed in claim 1 , wherein analysing the collected data comprises applying at least one of a data mining, machine learning or analytics algorithm to the collected data.
3. A method as claimed in claim 1 or 2, wherein operational data comprises at least one of:
operational configuration data for the hardware devices, or
results data for offsite hardware fault analysis conducted on the hardware devices or on related hardware devices.
4. A method as claimed in claim 3, wherein operational configuration data comprises at least one of:
neighbouring device connectivity;
operational software information;
operational firmware information; or
device operational purpose.
5. A method as claimed in any one of the preceding claims, wherein operational data comprises at least one of:
device configuration data;
device health data;
device performance data; or
service performance data for services provided over the device.
6. A method as claimed in any one of the preceding claims, further comprising: providing an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit; and
communicating the extraction tool to the at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
7. A method as claimed in any one of the preceding claims, further comprising: extracting, from the analysed data, performance statistics for network feedback messages; and
supplying the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
8. A method as claimed in any one of the preceding claims, wherein generating a heuristic for hardware device fault identification comprises:
generating a first version of a heuristic for hardware device fault identification; and
refining the generated first version heuristic.
9. A method as claimed in any one of the preceding claims, wherein the at least one local hardware diagnostic unit is at least partially implemented in a device management unit.
10. A method as claimed in any one of the preceding claims, wherein the at least one local hardware diagnostic unit is at least partially implemented in a hardware device.
1 1 . A method as claimed in any one of the preceding claims, wherein the network comprises a telecommunication network.
12. A method, in a local hardware diagnostic unit, for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit, the method comprising:
identifying hardware devices associated with the local hardware diagnostic unit; receiving, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices; receiving at least one of device configuration data, device health data or device performance data for the identified devices;
applying the received heuristic to the received data; and
outputting a result of the applied heuristic.
13. A method as claimed in claim 12, further comprising:
checking the received at least one of device configuration data, device health data or device performance data for a trigger; and
applying the received heuristic to the received data on detection of the trigger.
14. A method as claimed in claim 13, wherein the trigger comprises an item or received data indicating a perceived hardware fault.
15. A method as claimed in any one of the preceding claims, wherein the local hardware diagnostic unit is at least partially implemented in a device management unit.
16. A method as claimed in any one of the preceding claims, wherein the local hardware diagnostic unit is at least partially implemented in a hardware device.
17. A method as claimed in any one of the preceding claims, wherein the network comprises a telecommunication network.
18. A computer program product configured, when run on a computer, to carry out a method according to any one of the preceding claims.
19. A hardware fault management unit configured for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit, the hardware fault management unit comprising:
a data unit configured to collect operational data for the hardware devices;
an analysis unit configured to analyse the collected data;
a heuristic unit configured to generate, based on analysis conducted by the analysis unit, a heuristic for hardware device fault identification; and
a communication unit configured to communicate the generated heuristic to at least one of the local hardware diagnostic units.
20. A hardware fault management unit as claimed in claim 19, wherein the analysis unit is configured to apply at least one of a data mining, machine learning or analytics algorithm to the data collected by the data unit.
21 . A hardware fault management unit as claimed in claim 19 or 20, further comprising:
an extraction unit configured to provide an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit;
wherein the communication unit is further configured to communicate the extraction tool to the at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
22. A hardware fault management unit as claimed in any one of claims 19 to 21 , further comprising:
a performance unit configured to extract, from the analysed data of the analysis unit, performance statistics for network feedback messages; and
a feedback unit configured to supply the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
23 A hardware fault management unit as claimed in any one of the preceding claims, wherein the heuristic unit comprises:
an automated sub-unit, configured to generate a first version of a heuristic for hardware device fault identification; and
a refining sub-unit configured to refine the generated first version heuristic.
24. A local hardware diagnostic unit configured for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit, the local hardware diagnostic unit comprising:
an identifying unit configured to identify hardware devices associated with the local hardware diagnostic unit;
a heuristic unit configured to receive, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices;
a data unit configured to receive at least one of device configuration data, device health data or device performance data for the identified devices; an analysis unit configured to apply the received heuristic to the received data; and
an output unit configured to output a result of the applied heuristic.
25. A local hardware diagnostic unit as claimed in claim 24, wherein the analysis unit further comprises:
a trigger sub-unit configured to check the data received by the data unit for a trigger; and wherein the analysis unit is configured to apply the received heuristic to the received data on detection of the trigger by the trigger sub-unit.
26. A local hardware diagnostic unit as claimed in claim 24 or 25, wherein the local hardware diagnostic unit is at least partially implemented in a device management unit.
27. A local hardware diagnostic unit as claimed in any one of the claims 24 to 26, wherein the local hardware diagnostic unit is at least partially implemented in a hardware device.
28. A hardware fault management unit configured for managing hardware fault identification in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit, the hardware fault management unit comprising a processor and a memory, the memory containing instructions executable by the processor whereby the hardware fault management unit is operative to:
collect operational data for the hardware devices;
analyse the collected data;
generate, based on the analysis, a heuristic for hardware device fault
identification; and
communicate the generated heuristic to at least one of the local hardware diagnostic units.
29. A hardware fault management unit as claimed in claim 28, wherein the hardware fault management unit is further operative to apply at least one of a data mining, machine learning or analytic algorithm to the collected data.
30. A hardware fault management unit as claimed in claim 28 or 29, operational data comprises at least one of: operational configuration data for the hardware devices, or results data for offsite hardware fault analysis conducted on the hardware devices or on related hardware devices.
31 . A hardware fault management unit as claimed in claim 30, wherein operational configuration data comprises at least one of:
neighbouring device connectivity;
operational software information;
operational firmware information; or
device operational purpose.
32. A hardware fault management unit as claimed in any one of claims 28 to 31 , wherein operational data comprises at least one of:
device configuration data;
device health data;
device performance data; or
service performance data for services provided over the device.
33. A hardware fault management unit as claimed in any one of claims 28 to 32, wherein the hardware fault management unit is further operative to:
provide an extraction tool for extracting input data for the generated heuristic from device information available to a local hardware diagnostic unit; and
communicate the extraction tool to at least one of the local hardware diagnostic units with the heuristic for hardware device fault identification.
34. A hardware fault management unit as claimed in any one of claims 28 to 33, wherein the hardware fault management unit is further operative to:
extract, from the analysed data, performance statistics for network feedback messages; and
supply the extracted performance statistics to a network feedback tool for refinement of the network feedback messages.
35. A hardware fault management unit as claimed in any one of claims 28 to 34, wherein the hardware fault management unit is further operative to:
generate a first version of a heuristic for hardware device fault identification; and refine the generated first version heuristic.
36. A hardware fault management unit as claimed in any one of claims 28 to 35, wherein the network comprises a telecommunication network.
37. A local hardware diagnostic unit configured for identifying hardware faults in a network, the network comprising a plurality of hardware devices, each hardware device associated with a local hardware diagnostic unit, the local hardware diagnostic unit comprising a processor and a memory, the memory containing instructions executable by the processor whereby the local hardware diagnostic unit is operative to:
identify hardware devices associated with the local hardware diagnostic unit; receive, from a hardware fault management unit, a heuristic for hardware fault identification corresponding to the identified devices;
receive at least one of device configuration data, device health data or device performance data for the identified devices;
apply the received heuristic to the received data; and
output a result of the applied heuristic.
38. A local hardware diagnostic unit as claimed in claim37, wherein the local hardware diagnostic unit is further operative to:
check the received at least one of device configuration data, device health data or device performance data for a trigger; and
apply the received heuristic to the received data on detection of the trigger.
39. A local hardware diagnostic unit as claimed in claim 38, wherein the trigger comprises an item or received data indicating a perceived hardware fault.
40. A local hardware diagnostic unit as claimed in any one of claims 37 to 39, wherein the local hardware diagnostic unit is at least partially implemented in a device management unit.
41 . A local hardware diagnostic unit as claimed in any one of claims 37 to 40, wherein the local hardware diagnostic unit is at least partially implemented in a hardware device.
42. A local hardware diagnostic unit as claimed in any one of claims 37 to 41 , wherein the network comprises a telecommunication network.
PCT/EP2014/067569 2014-08-18 2014-08-18 Hardware fault identification management in a network WO2016026510A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2014/067569 WO2016026510A1 (en) 2014-08-18 2014-08-18 Hardware fault identification management in a network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2014/067569 WO2016026510A1 (en) 2014-08-18 2014-08-18 Hardware fault identification management in a network

Publications (1)

Publication Number Publication Date
WO2016026510A1 true WO2016026510A1 (en) 2016-02-25

Family

ID=51429260

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2014/067569 WO2016026510A1 (en) 2014-08-18 2014-08-18 Hardware fault identification management in a network

Country Status (1)

Country Link
WO (1) WO2016026510A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223765A (en) * 2018-03-01 2019-09-10 西门子医疗有限公司 The method of fault management is executed in an electronic
CN113435307A (en) * 2021-06-23 2021-09-24 国网天津市电力公司 Operation and maintenance method, system and storage medium based on visual identification technology
WO2021242237A1 (en) * 2020-05-28 2021-12-02 Siemens Canada Limited Artificial intelligence based device identification
CN114095339A (en) * 2021-10-29 2022-02-25 北京百度网讯科技有限公司 Alarm processing method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6199018B1 (en) * 1998-03-04 2001-03-06 Emerson Electric Co. Distributed diagnostic system
US20040117153A1 (en) * 2002-12-17 2004-06-17 Xerox Corporation Automated self-learning diagnostic system
US20050182834A1 (en) * 2004-01-20 2005-08-18 Black Chuck A. Network and network device health monitoring
GB2421656A (en) * 2004-12-23 2006-06-28 Nortel Networks Ltd Distributed network fault analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6199018B1 (en) * 1998-03-04 2001-03-06 Emerson Electric Co. Distributed diagnostic system
US20040117153A1 (en) * 2002-12-17 2004-06-17 Xerox Corporation Automated self-learning diagnostic system
US20050182834A1 (en) * 2004-01-20 2005-08-18 Black Chuck A. Network and network device health monitoring
GB2421656A (en) * 2004-12-23 2006-06-28 Nortel Networks Ltd Distributed network fault analysis

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223765A (en) * 2018-03-01 2019-09-10 西门子医疗有限公司 The method of fault management is executed in an electronic
CN110223765B (en) * 2018-03-01 2023-09-05 西门子医疗有限公司 Method for performing fault management in electronic device
WO2021242237A1 (en) * 2020-05-28 2021-12-02 Siemens Canada Limited Artificial intelligence based device identification
US11770315B2 (en) 2020-05-28 2023-09-26 Siemens Canada Limited Artificial intelligence based device identification
CN113435307A (en) * 2021-06-23 2021-09-24 国网天津市电力公司 Operation and maintenance method, system and storage medium based on visual identification technology
CN114095339A (en) * 2021-10-29 2022-02-25 北京百度网讯科技有限公司 Alarm processing method, device, equipment and storage medium
CN114095339B (en) * 2021-10-29 2023-08-08 北京百度网讯科技有限公司 Alarm processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
EP3131234B1 (en) Core network analytics system
US10474519B2 (en) Server fault analysis system using event logs
US9817709B2 (en) Systems and methods for automatic replacement and repair of communications network devices
EP4235436A2 (en) Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data
US9973392B2 (en) Hierarchical network analysis service
CN111913133A (en) Distributed fault diagnosis and maintenance method, device, equipment and computer readable medium
EP3616066B1 (en) Human-readable, language-independent stack trace summary generation
US8245079B2 (en) Correlation of network alarm messages based on alarm time
US20140114613A1 (en) Method and apparatus for diagnosis and recovery of system problems
US7647530B2 (en) Network fault pattern analyzer
JP2017509262A (en) Identify network failure troubleshooting options
US11252052B1 (en) Intelligent node failure prediction and ticket triage solution
US10341182B2 (en) Method and system for detecting network upgrades
WO2016026510A1 (en) Hardware fault identification management in a network
CN110716842A (en) Cluster fault detection method and device
Xu et al. Logdc: Problem diagnosis for declartively-deployed cloud applications with log
CN111108481B (en) Fault analysis method and related equipment
US20200076707A1 (en) Autonomic or AI-assisted validation, decision making, troubleshooting and/or performance enhancement within a telecommunications network
US11263072B2 (en) Recovery of application from error
CN113472577B (en) Cluster inspection method, device and system
US20050240799A1 (en) Method of network qualification and testing
CN100433644C (en) Diagnostic device using adaptive diagnostic models, for use in a communication network
CN112838944B (en) Diagnosis and management, rule determination and deployment method, distributed device, and medium
US20230188440A1 (en) Automatic classification of correlated anomalies from a network through interpretable clustering
US9354962B1 (en) Memory dump file collection and analysis using analysis server and cloud knowledge base

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14757872

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14757872

Country of ref document: EP

Kind code of ref document: A1