WO2003090081A1

WO2003090081A1 - A hierarchical system for analysing data streams

Info

Publication number: WO2003090081A1
Application number: PCT/AU2003/000460
Authority: WO
Inventors: George Bolt; John Manslow
Original assignee: Neural Technologies Ltd; Toms, Alvin, David
Priority date: 2002-04-16
Filing date: 2003-04-16
Publication date: 2003-10-30
Also published as: GB0208711D0; EP1499969A1; AU2003218899A1

Abstract

A method for analysing data streams comprises receiving a data stream (12), conducting a first analysis of the data stream (14) for a possible target activity, and if a possible target activity is indicated generating a first alert (28). If the first alert (28) is generated, a second analysis (16) for the possible target activity is conducted to determine whether the target activity is indicated in the data stream with a high degree of certainty. If a possible target activity is indicated by the second analysis, a second alert (34) is generated and provided to an external system for action.

Description

A HIERARCHICAL SYSTEM FOR ANALYSING DATA STREAMS

FIELD OF THE INVENTION

[0001] The present invention relates to a hierarchical system for analysing data streams. In particular, the present invention relates to analysing data streams to identify target events. A target event may be an instance of fraud on a telephone system, however the present invention has applications in other high data volume environments to identify other target events/activities.

BACKGROUND OF THE INVENTION

[0002] Fraud is a serious problem in modern telecommunication systems, and can result in revenue loss by the telecommunications service provider, reduced operational efficiency, and an increased risk of subscribers moving to other providers that are perceived to offer better security. In the highly competitive telecommunications sector, any provider that can reduce revenue loss resulting from fraud - either by its prevention or early detection - has a significant advantage over its competitors.

[0003] Telecommunications networks support many hundreds or thousands of transactions per second, and one of the challenges in developing effective fraud detection systems is to achieve the high throughput necessary to analyse all network traffic in detail and in real time. In practice, fraud detection systems frequently ignore services that are considered to be low risk (e.g. low cost calls), or limit the sophistication of the fraud detection algorithms in order to achieve the required throughput.

[0004] Each of these has critical disadvantages - ignoring services automatically precludes the detection of fraud on those services - which is particularly hazardous because fraudsters actively search for unprotected services. Similarly, the use of fast but inaccurate algorithms increases the range of frauds that cannot be detected without increasing the number of false alerts. Telecommunications service providers are therefore often forced to accept higher false alert rates in order to maintain sensitivity at high throughput, and hence incur additional costs resulting from an enlarged fraud investigation team that is required to process the extra alerts.

SUMMARY OF THE PRESENT INVENTION

[0005] The present invention provides a system of hierarchical data analysis that seeks to provide high throughput and sensitivity with less false positive alerts of possible target activity.

[0006] According to a first aspect of the present invention there is provided a method for analysing data streams comprising at least the steps of: receiving a data stream; conducting a first analysis of the data stream for a possible target activity, and if a possible target activity is indicated generating a first alert; if the first alert is generated, conducting a second analysis for the possible target activity to determine whether the target activity is indicated in the data stream with a high degree of certainty, if a possible target activity is indicated by the second analysis, generating a second alert; and providing the second alert to an external system for action.

[0007] Preferably the first analysis step comprises at least: conducing a first sub-analysis of the data stream for the possible target activity to determine whether the target activity is indicated in the data stream, if the possible target activity is indicated by the first sub-analysis then a first sub-alert is generated; and conducting a second sub-analysis of the data stream for the possible target activity to determine whether the target activity is indicated in the data stream with a higher degree of certainty than in the first sub-analysis, if the possible target activity is indicated by the second sub-analysis then the first alert is generated.

[0008] Preferably the second sub-analysis provides an indication of the target activity with a higher degree of certainty than in the first sub-analysis. Preferably the second analysis provides an indication of the target activity with a higher degree of certainty than in the second sub-analysis. [0009] Preferably the method further comprises propagating data from the data stream relevant to the second sub-analysis for conducting the second sub-analysis.

[0010] Preferably the method further comprises the step of propagating data from the data stream relevant to the second analysis for conducting the second analysis.

[0011] Preferably the second sub-analysis is conducted on additional data to the propagated data. Preferably the second analysis is conducted using additional data to the data propagated for the second analysis.

[0012] Preferably one or more additional levels of sub-analysis are conducted between the first sub-analysis and the second sub-analysis wherein an alert is generated by one of the additional levels and passed to a next of the additional levels. Preferably a subsequent analysis is conducted while determining whether the target activity is indicated to a higher degree of certainty than the previous level. Preferably the first sub-alert triggers the first of one or more additional levels of sub-analysis and the alert generated by the final level of additional sub-analysis triggers the second sub-analysis.

[0013] Preferably data is propagated from one additional level of sub-analysis to the next and includes data necessary in the subsequent levels of additional sub-analysis.

[0014] Preferably each additional level of sub-analysis is conducted on additional data specific to the type of analysis conducted in addition to the propagated data.

[0015] Preferably each level of the sub-analysis creates a third alert if a fraudulent activity is indicated with a relatively high degree of certainty, any one of the second alerts and third alerts triggering an action in the external system.

[0016] Preferably the first analysis may conduct one or more types of analysis in parallel.

[0017] Preferably one or more of the additional levels of sub-analysis may conduct one ^• or more types of analysis in parallel.

[0018] Preferably the target activity is fraudulent activity.

[0019] According to a second aspect of the present invention there is provided a system for analysing data streams comprising at least: a first analyser arranged to analyse a data stream for possible target activity and if a possible target activity is indicated to generate a first alert; a second analyser arranged to conduct an analysis for possible target activity if the first alert is generated, and if a possible target activity is indicated with a relatively high probability by the second analysis to generate a second alert for an external system to act on.

[0020] According to a third aspect of the present invention there is provided a system for analysing datastreams comprising at least: one or more sequential analysers are arranged to conduct an analysis for possible target activity, a first analyser of the sequence of analysers analysing a data stream, each subsequent analyser of the sequence of analysers only conducting its analysis if the previous analyser indicates a possible target activity, and if a possible target activity is indicated by each analysis generating a subsequent alert for the next analyser; and a final analyser arranged to conduct an analysis for possible target activity if the last analyser of the sequence of analysers generates an alert, and if a possible target activity is indicated with a relatively high probability by the analysis of the final analyser, the final analyser generates an alert for an external system to act on.

[0021] According to another aspect of the present invention there is provided a method of analysing data streams comprising at least: conducing one or more sequential analyses of a data stream for possible target activity, the first of the analyses being conducted directly on the data stream, subsequent analyses after the first, only being conducted if the previous analysis indicated a possible target activity; conducting a final analysis for possible target activity if the last of the sequential analyses indicated a possible target activity; and if the final analysis indicates a possible target activity with a relatively high degree of certainty generating an alert to an external system for action.

BRIEF DESCRIPTION OF THE DRAWINGS

[0022] In order to facilitate a better understanding of the nature of the invention, preferred embodiments will now be described in greater detail, by way of example only, with reference to the accompanying drawings in which:

Figure 1 is a schematic representation of a preferred embodiment of a system for analysing data streams in accordance with the present invention; and

Figure 2 is a schematic representation exemplifying data analysis using the system of Figure 1.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS [0023] Referring to Figure 1 there is shown a system 10 that receives a data stream 12 (that may include one or more sub-streams) and outputs a data stream of alerts 34 for use by an external system. The system 10 includes a plurality of data analysis modules, in this case three are shown 14, 16 and 18. Each of the analysis modules 14, 16 and 18 receives respective additional data 20, 22 and 24 used in the analysis of the data stream 12 provided to the first data module 14. Each data module 14, 16 and 18 propagates data to the next data module indicated by propagated data 26 and 30. Each data module provides internal alerts 28 and 32 to the subsequent data module.

[0024] In the present example the system 10 is configured to identify suspicious telephone activity that may indicate fraud. Due to the high volume of telephone call data required to be processed, each data analysis module can provide a different analysis technique to progressively increase the certainty that the data indicated the presence of fraudulent telephone activity.

[0025] The system 10 may be implemented in the form of a computer or a network of computers programmed to perform the analysis of each of the modules. For example, a single computer can be programmed to run the system or a dedicated computer may be programmed to conduct each of the analysis of each of the modules with communication being provided between each of the computers of the whole system 10.

[0026] Each of the data analysis modules 14, 16 and 18 cascade data initially provided by data stream 12 to the subsequent module. The data stream 12 could, for example, include call data records (CDRs, which contain details of the calls made on a telecommunication network). For example, a portion of a CDR produced from a real call is given in Table 1. The fields contained in the CDR are (from top to bottom) A-number (the number of the phone from which the call was made), B-number (the number to which the call was made), B-number type (whether it was local, national, international etc encoded as a number), the call's cost, its duration and the date and time at which it started. Note that the four rightmost digits of the A- and B-numbers have been masked to conceal the identities of call to call parties. The data stream 12 can also include several substreams from different sources. For example, one substream could be a CDR stream, while another could provide customer information such as postcodes and payment histories.

TABLE 1

[0027] Each of the data analysis modules 14, 16 and 18 contains one or more fraud detection engines that analyse their input data for signs of fraudulent activity, in response to which they generate alerts. Each fraud detection engine can process different subsets of the modules' input data. Each data analysis module after the first, receives propagated data that is passed from the analysis module immediately receiving it in the hierarchy. The additional data available to each data analysis module may be specific to the type of analysis conducted by that particular data analysis module. The propagated data may contain low level data from the original data stream 12 or additional data used by data analysis modules lower in the hierarchy, depending on the configuration of the system 10.

[0028] The distinction between propagated data and additional data is important for the efficiency of the system because the analyses performed within particular analysis modules may require particular access to potentially large quantities of data that are not required elsewhere within the system. Propagating data that is not required in other analysis modules is a waste of resources and is likely to reduce the rate at which the system can process incoming data. Propagated data consists of information that is used in more that one data analysis module. For example, the A-number field is used to identify the calling party, is provided within the CDR stream that usually forms part of the systems input 12, and is usually required throughout the system, and hence usually propagated through the system rather than forming part of the additional data inputs.

[0029] Each of the data analysis modules 14, 16 and 18 can generate internal and external alerts. External alerts 34 are combined from all of the modules 14, 16 and 18 to form the output 34 of the system. Combining the outputs may be the equivalent of providing a logical OR to each of the alerts, so that if any of the modules generates an external alert, the system as a whole generates the alert. External alerts are only produced by the modules when the calculated probability of a target activity (fraud) is sufficiently high to reasonably conclude that fraud has occurred. What is considered a high probability depends on the particular application, its expected throughput, and the desired degree of certainty. When individual calls are analysed for fraud within telecommunication networks, a probability as large as 0.99995 to 0.99999 may be required to keep the number of alerts to a manageable level (since large networks can experience as many as 100 million calls per day).

[0030] Each of the data analysis modules 14, 16 and 18 can generate internal alerts if its analysis reveals something unusual, but does not provide sufficiently high probability that target activity is indicated to warrant an external alert. Internal alerts are important for regulating the activity of subsequent data analysis modules within the hierarchy of the system, because subsequent data analysis modules may only be activated if an internal alert is received, indicating that further analysis of the data is required to obtain the sufficient degree of certainty to generate an external alert. Subsequent data analysis modules 16 and 18 may only be activated if they receive an internal alert 28 or 32 from a proceeding analysis module or if any of its input data is updated. Preferably the additional data is only provided in response to a request made by a lower module and the input additional data is not configured to activate an analysis module.

[0031] For example, an analysis module 14, 16 or 18 may identify a short term increase in the total cost of calls made by a particular subscriber, which may not be severe enough to conclude that fraud has occurred and hence to generate an external alert. A subsystem may therefore generate an internal alert that causes the next module in the system to perform its analysis. This cascaded activation of analysis modules within the system means that lower level subsystems are activated most frequently and that the throughput of the system can be maximised by designing the lower level subsystems to require a minimum amount of processing. Higher level analysis, which is activated less frequently can thus use more expensive processes (such as nonlinear or iterative functions) and can perform expensive operations (such as database reads and writes) or make use of human intervention, with minimal effect on the throughput of the entire system. For example, a neural network could be trained to estimate the probability that a particular telephone call was fraudulent based on its characteristics (cost, duration, etc.) or Fourier analysis could be used to see if a short term fluctuation in the calling activity was part of a cycle of a subscriber's normal behaviour in an analysis module that becomes active only once a lower level system has generated an alert.

[0032] Dividing the system into a series of stages of increasing complexity of different (and in particular, increasing) complexity, also simplifies the problem of targeting different resources at different subsystems. For example, the lower level subsystems may need some level of parallelism in order to achieve the required throughput and thus can be distributed across several computers. Later stages may require so little resources that several can be run simultaneously on a single computer while others may require user interaction or database access, placing specific requirements on their geographic location. By building a fraud detection system from a hierarchy of subsystems of increasing sophistication it is possible to produce a superior trade off between fraud detection accuracy and throughput.

[0033] Each of the data analysis modules should be designed to generate many more internal false positives (that is, internal alerts for events that are not actually fraudulent) than internal false negatives (where an internal alert was not generated when fraud did in fact occur). This is because the higher level subsystems that are activated by the internal alerts may be able to provide a higher degree of certainty to confirm or refute the internal alert based on different analysis techniques and/or the inclusion of additional data in the analysis to clarify whether, with the required of certainty, the data indicates that a fraud is actually present. If the system is not designed in this way, then when false negatives occur the higher level subsystems are never activated and thus are not able to correct an error made by the lower level subsystem.

[0034] Conversely, the analysis modules 14, 16 and 18 are designed to generate a small number of external false positives (external alerts generated for events that are not actually fraudulent) and a large number of external false negatives (resulting in no external alert being generated when in fact a fraud did occur). This is because provided that an internal alert was generated, the external false negative can be corrected by higher level analysis modules generating their own external alerts. In a situation where a false positive external alert is generated the system as a whole will generate an alert that can't be prevented by analysis conducted at a subsequent level modules even if subsequent modules were activated.

[0035] Figure 2 shows an example of a real telecommunications fraud detection system based on the system 10. The input data stream 12 includes a CDR stream that provides details of each call made on the telecommunications network shortly after the call is terminated. The CDR stream is passed to the lowest level data analysis module 14 which is configured as a candidate fraud detector (CFD). The CFD contains two separate fraud detection algorithms, based on a set of rules 36 that search directly for common fraud indicators (such as more than 8 hours of calls to the Caribbean in any 24 hour period), and change detection algorithm 38 that searches for unusual changes in the pattern of behaviour associated with individual subscribers (which can indicate that a line has been taken over by fraudsters). These two components 36 and 38 of the lowest level data analysis module 14 operate independently. An internal alert 28 is generated when either of its components 36 and 38 indicates that a particular telephone call is a fraud candidate. The rules 36 and change detector 38 are designed to be fast and simple because the CDR stream 12 can present the data analysis module with as many as 100 million CDRs per day. The internal alerts 28 are passed to the next level data analysis module which operates as an intelligent alarm analyser (IAA) which is only activated when an internal alert is generated by the CFD.

[0036] With a typical fraud detection configuration, the ratio of the number of CDRs to the number of internal alerts 28 is about 1000:1 meaning that statistically the IAA is activated only once for every 1000 times the CFD is activated. The IAA is a rule based system that removes some of the false alerts generated by the CFD by performing complex analysis on the distributions of the alerts themselves. These complex analyses are possible due to the low level of activity demanded of the IAA compared to the CFD. The analyses also require time information (real world, date and time) which is provided to the IAA as additional data 22. When the IAA considers the distribution of alerts to be sufficiently suspicious, it generates an internal alert 32 which is passed to the next level data analysis module 18. The ratio of the numbers of alerts generated by the CFD compared to those generated by the IAA is usually around 500:1, meaning that statistically the third level of data analysis is activated once every 500 times the IAA is activated.

[0037] The third level data analysis module operates as a case manager. The case manager may be a team committed by the telecommunications operator employed for the purpose of investigating the events that caused internal alerts to be generated by the IAA. Because the case manager is a higher level subsystem it is activated only once every 500,000 or so CDRs and hence can use much slower and more expensive processing methods such as manual investigations of potential frauds than either the CFD or IAA without being overwhelmed.

[0038] The case manager uses customer information (names, addresses, payment histories, etc.) as further additional data 24 and frequently a wide variety of additional data sources (six month history of calls made by a particular customer) to investigate internal alerts 32 generated by the IAA to determine whether they are likely to be cases of actual fraud. If it is determined that they are, the case manager subsystem generates an external alert 34 which is passed out of the system. The alert could be used for a variety of purposes, such as to inform billing services within the network operator to remove fraudulent calls from a customer's bill, or to inform law enforcement agencies.

[0039] In this example, neither the CFD nor the IAA generate external alerts because of the technical difficulties in guaranteeing extremely low false alert rates that are required for the purposes for which the external alerts are intended. However it will be appreciated that in other configurations, these modules may be suited to generating external alerts. It is also noted that in this example, null additional data 20 is provided to the CFD. Furthermore, it is also noted that no data is propagated from the CFD to the IAA or from the IAA to the case manager. It is further noted that in an alternative configuration, additional data may be provided to the CFD or data may be propagated from the CFD to the IAA and possibly then from the IAA to the case manager.

[0040] It will be appreciated by the person skilled in the art that the hierarchical system and method of the present invention may be applied to data streams that originate from a variety of sources to identify target events. The above example of fraud detection on a telecommunications network is not intended to be limiting.

[0041] It will be appreciated that modifications may be made to the preferred forms of the present invention without departing from the basic inventive concept. Such modifications are intended to fall within the scope of the present invention, the nature of which is to be determined from the foregoing description and appended claims.

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1. A method for analysing data streams comprising at least the steps of: receiving a data stream; conducting a first analysis of the data stream for a possible target activity, and if a possible target activity is indicated generating a first alert; if the first alert is generated, conducting a second analysis for the possible target activity to determine whether the target activity is indicated in the data stream with a high degree of certainty, if a possible target activity is indicated by the second analysis, generating a second alert; and providing the second alert to an external system for action.

2. A method according to claim 1, wherein the first analysis step comprises at least: conducing a first sub-analysis of the data stream for the possible target activity to determine whether the target activity is indicated in the data stream, if the possible target activity is indicated by the first sub-analysis then a first sub-alert is generated; and conducting a second sub-analysis of the data stream for the possible target activity to determine whether the target activity is indicated in the data stream with a higher degree of certainty than in the first sub-analysis, if the possible target activity is indicated by the second sub-analysis then the first alert is generated.

3. A method according to claim 2, wherein the second sub-analysis provides an indication of the target activity with a higher degree of certainty than in the first sub- analysis.

4. A method according to claim 3, wherein the second analysis provides an indication of the target activity with a higher degree of certainty than in the second sub-analysis.

5. A method according to claim 2, wherein the method further comprises propagating data from the data stream relevant to the second sub-analysis for conducting the second sub-analysis.

6. A method according to claim 2, wherein the method further comprises the step of propagating data from the data stream relevant to the second analysis for conducting the second analysis.

7. A method according to claim 2, wherein the second sub-analysis is conducted on additional data to the propagated data.

8. A method according to claim 7, wherein the second analysis is conducted using additional data to the data propagated for the second analysis.

9. A method according to claim 2, wherein one or more additional levels of sub- analysis are conducted between the first sub-analysis and the second sub-analysis wherein an alert is generated by one of the additional levels and passed to a next of the additional levels.

10. A method according to claim 9, wherein a subsequent analysis is conducted while determining whether the target activity is indicated to a higher degree of certainty than the previous level.

11. A method according to claim 10, wherein the first sub-alert triggers the first of one or more additional levels of sub-analysis and the alert generated by the final level of additional sub-analysis triggers the second sub-analysis.

12. A method according to claim 11, wherein data is propagated from one additional level of sub-analysis to the next and includes data necessary in the subsequent levels of additional sub-analysis.

13. A method according to claim 12, wherein each additional level of sub-analysis is conducted on additional data specific to the type of analysis conducted in addition to the propagated data.

14. A method according to claim 13, wherein each level of the sub-analysis creates a third alert if a fraudulent activity is indicated with a relatively high degree of certainty, any one of the second alerts and third alerts triggering an action in the external system.

15. A method according to claim 1, wherein the first analysis may conduct one or more types of analysis in parallel.

16. A method according to claim 2, wherein one or more of the additional levels of sub-analysis may conduct one or more types of analysis in parallel.

17. A system for analysing data streams comprising at least: a first analyser arranged to analyse a data stream for possible target activity and if a possible target activity is indicated to generate a first alert; a second analyser arranged to conduct an analysis for possible target activity if the first alert is generated, and if a possible target activity is indicated with a relatively high probability by the second analysis to generate a second alert for an external system to act on.

18. A system for analysing data streams comprising at least: one or more sequential analysers arranged to conduct an analysis for possible target activity, a first analyser of the sequence of analysers analysing a data stream, each subsequent analyser of the sequence of analysers only conducting its analysis if the previous analyser indicates a possible target activity, and if a possible target activity is indicated by each analysis generating a subsequent alert for the next analyser; and a final analyser arranged to conduct an analysis for possible target activity if the last analyser of the sequence of analysers generates an alert, and if a possible target activity is indicated with a relatively high probability by the analysis of the final analyser, the final analyser generates an alert for an external system to act on.

19. A system for analysing data streams comprising at least: conducing one or more sequential analyses of a data stream for possible target activity, the first of the analyses being conducted directly on the data stream, subsequent analyses after the first, only being conducted if the previous analysis indicated a possible target activity; conducting a final analysis for possible target activity if the last of the sequential analyses indicated a possible target activity; and if the final analysis indicates a possible target activity with a relatively high degree of certainty generating an alert to an external system for action.