US20040215656A1

US20040215656A1 - Automated data mining runs

Info

Publication number: US20040215656A1
Application number: US10/423,011
Authority: US
Inventors: Marcus Dill; Harish Mahabal; Lakshmi Shankar; Jens Weidner; Bernd Ecker
Original assignee: Individual
Current assignee: SAP SE
Priority date: 2003-04-25
Filing date: 2003-04-25
Publication date: 2004-10-28
Also published as: US7571191B2; WO2004097667A2; US20050027683A1; EP1623343A2; WO2004097667A3; US20040267751A1

Abstract

A data mining run that includes special analyses is triggered directly after having loaded new data in a data warehouse environment, to enrich the newly loaded data by new attributes. The process automates replicating transaction data from a source system into a data warehouse, triggering a data mining procedure (such as a training or a prediction procedure) that enriches the data with new attributes, and triggering the upload of the enriched data back into the data warehouse.

Description

TECHNICAL FIELD

This description relates to loading and using data in a data warehouse on a computer system.

BACKGROUND

Computer systems often are used to manage and process business data. To do so, a business enterprise may use various application programs running on one or more computer systems. Application programs may be used to process business transactions, such as taking and fulfilling customer orders, providing supply chain and inventory management, performing human resource management functions, and performing financial management functions. Data used in business transactions may be referred to as transaction data or operational data. Often, transaction processing systems provide real-time access to data, and such systems may be referred to as on-line transaction processing (OLTP) systems.

Application programs also may be used for analyzing data, including analyzing data obtained through transaction processing systems. In many cases, the data needed for analysis may have been produced by various transaction processing systems and may be located in many different data management systems. A large volume of data may be available to a business enterprise for analysis.

When data used for analysis is produced in a different computer system than the computer system used for analysis or when a large volume of data is used for analysis, the use of an analysis data repository separate from the transaction computer system may be helpful. An analysis data repository may store data obtained from transaction processing systems and used for analytical processing. The analysis data repository may be referred to as a data warehouse or a data mart. The term data mart typically is used when an analysis data repository stores data for a portion of a business enterprise or stores a subset of data stored in another, larger analysis data repository, which typically is referred to as a data warehouse. For example, a business enterprise may use a sales data mart for sales data and a financial data mart for financial data.

Analytical processing may be used to analyze data stored in a data warehouse or other type of analytical data repository. When an analytical processing tool accesses the data warehouse on a real-time basis, the analytical processing tool may be referred to as an OLAP system. An OLAP system may support complex analyses using a large volume of data. An OLAP system may produce an information model using a three-dimensional presentation, which may be referred to as an information cube or a data cube.

One type of analytical processing identifies relationships in data stored in a data warehouse or another type of data repository. The process of identifying data relationships by means of an automated computer process may be referred to as data mining. Sometimes a data mining mart may be used to store a subset of data extracted from a data warehouse. A data mining process may be performed on data in the data mining mart, rather than the data mining process being performed on data in the data warehouse. The results of the data mining process then are stored in the data warehouse. The use of a data mining mart that is separate from a data warehouse may help decrease the impact on the data warehouse of a data mining process that requires significant system resources, such as processing capacity or input/output capacity. Also, data mining marts may be optimized for access by data mining analyses that provide faster and more flexible access.

One type of data relationship that may be identified by a data mining process is an associative relationship in which one data value is associated or otherwise occurs in conjunction with another data value or event. For example, an association between two or more products that are purchased by a customer at the same time may be identified by analyzing sales receipts or sales orders. This may be referred to as a sales basket analysis or a cross-selling analysis. The association of products purchases may be based on a pairing of two products, such as when a customer purchases product A, the customer also purchases product B. The analysis may also reveal relationships between three products, such as when a customer purchases product A and product B, the customer also typically purchases product C. The results of a cross-selling analysis may be used to promote associated products, such as through a marketing campaign that promotes the associated products or by locating the associated products near one another in a retail store, such as by locating the products in the same aisle or shelf.

Customers that are at risk of not renewing a sales contract or not purchasing products in the future also may be identified by data mining. Such an analysis may be referred to as a churn analysis in which the likelihood of churn refers to the likelihood that a customer will not purchase products or services in the future. A customer at risk of churning may be identified based on having similar characteristics to customers that have already churned. The ability to identify a customer at risk of churning may be advantageous, particularly when steps may be taken to reduce the number of customers who do churn. A churn analysis may also be referred to as a customer loyalty analysis.

For example, in the telecommunications industry a customer may be able to switch from one telecommunication provider to another telecommunications provider relatively easily. A telecommunications provider may be able to identify, using data mining techniques, particular customers that are likely to switch to a different telecommunications provider. The telecommunications provider may be able to provide an incentive to at-risk customers to decrease the number of customers who switch.

In general, using data for special data analysis, such as the application of data mining techniques, involves a fixed sequence of processes, in which each process occurs only after the completion of a predecessor process. For example, in a data warehouse that uses a separate data mining mart for the performance of a data mining process, three processes may need to be performed in order. First, data must be loaded to a data warehouse from a transaction data management system. Second, data from the data warehouse must be copied to a data mining mart and the data mining process must be performed. Third, the enriched or new data that results from the data mining process must be loaded to the data warehouse. Each of those processes may be triggered separately, often by different users. As a result, the data mining process is performed separately from the loading of the new data to the data warehouse. In some cases, performing the data mining process may occur days, or even weeks, after the data has been loaded from the transaction processing system and is available for analysis.

A delay in performing data mining analysis may be problematic when the results of the analysis are most useful at a particular time. For example, the value of a churn prediction for a particular customer or group of customers may be time-sensitive. After a customer purchases a service or product elsewhere, the opportunity of the business enterprise to influence the behavior of the customer is lost. When the identification of a high likelihood of churning occurs after the customer has been lost, the data mining result is wasted.

Some aspects of creating or using a data warehouse may be automated, that is initiated without user manipulation. For example, an automated software agent may be employed to collect data from various distributed databases to collect data for a data warehouse. Using an OLAP system, a report or other type of output may be automatically generated and sent to various receiving devices, such as a personal digital assistant, a printer, or a pager. When transaction data is input to a transaction processing system, the online transaction data may be automatically summarized and stored as summary data.

SUMMARY

Generally, the invention automates the triggering of special analyses directly after having loaded new data in a data warehouse environment to enrich the newly loaded data with new attributes. The invention automates, without requiring user manipulation, copying data from a data warehouse to a data mining mart, the triggering of a data mining procedure (such as a training or a prediction procedure) that enriches the data with new attributes, and the triggering of the upload of the enriched data to the data warehouse. The invention also may automate, without requiring user manipulation, the loading of transaction data from a source system into a data warehouse before triggering the data mining process. One area where the invention may find specific applicability is in performing a data mining procedure on a regular, predetermined basis. For example, sales receipts for a particular month may be automatically loaded into a data warehouse and analyzed for associative sales relationships. Another example is performing periodic analysis of customer activity to identify customers that are at risk of churning for the purpose of influencing customer behavior.

In one general aspect, a data mining process may be automatically triggered. An analytical process is triggered based on the presence of data in a data source that is used for analytical processing. The analytical process is performed on data from the data source after the analytical process has been triggered. The analytical process uses a procedure that also is usable in a data extraction process. The created data attribute is stored in the data source.

Implementations may include one or more of the features noted above and one or more of the following features. For example, the analytical process may be triggered based on the completion of a computer program for loading data to the data source that is used for analytical processing.

Also, data may be extracted from a data source used for transaction processing and loaded to the data source that is used for analytical processing. A person may initiate at most the step of extracting data from the data source used for transaction processing or the step of loading the extracted data. The occurrence of a predetermined date and time may trigger extracting the data or may trigger loading the data.

In addition to extracting data from a transaction data source, data also may be extracted from the data source that is used for analytical processing and loaded to temporary data storage. The analytical process may be performed on the data stored in the temporary data storage.

The types of analytical processes that may be triggered include an analytical process to determine a relationship between two data values in the data source, or determine a relationship between two data values that predict a likelihood of whether a particular customer will fail to purchase a service or product in the future. The analytical process also may apply a relationship that has been previously-determined to data values in the data source. The analytical process also may identify products or services that are purchased in the same transaction. For example, the analytical process may determine the likelihood of whether a particular customer will fail to purchase a service or a product in the future. The likelihood may be based on characteristics associated with customers who have been identified as failing to purchase a service or a product.

Implementations of the techniques discussed above may include a method or process, a system or apparatus, or computer software on a computer-accessible medium. The details of one or more implementations of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.

DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram of a system incorporating various aspects of the invention. [0020]
FIG. 2 is a block diagram illustrating the enrichment of data stored in the data warehouse based on an automated data mining run. [0021]
FIGS. 3 and 4 are flow charts of processes to automate a data mining process. [0022]
FIG. 5 is a block diagram of the components of a software architecture for automating a data mining run. [0023]
FIG. 6 is a block diagram of a process to use a data mining workbench to design an automated data mining process.[0024]

DETAILED DESCRIPTION

FIG. 1 shows a block diagram of a [0025] system 100 of networked computers, including a computer system 110 for a data warehouse and transaction computer systems 120 and 130. The loading of new data to the data warehouse 110 from the transaction computer systems 120 and 130 triggers a special analysis to enrich the newly loaded data with new attributes.
The [0026] system 100 includes a computer system 110 for a data warehouse, a client computer 115 used to administer the data warehouse, and transaction computer systems 120 and 130, all of which are capable of executing instructions on data. As is conventional, each computer system 110, 120 or 130 includes a server 140, 142 or 144 and a data storage device 145, 146 or 148 associated with each server. Each of the data storage devices 145, 146 and 148 includes data 150, 152 or 154 and executable instructions 155, 156 or 158. A particular portion of data, here referred to as business objects 162 or 164, is stored in computer systems 120 and 130, respectively. Each of business objects 162 or 164 includes multiple business objects. Each business object in business objects 162 or 164 is a collection of data attribute values, and typically is associated with a principal entity represented in a computing device or a computing system. Examples of a business object include information about a customer, an employee, a product, a business partner, a product, a sales invoice, and a sales order. A business object may be stored as a row in a relational database table, an object instance in an object-oriented database, data in an extensible mark-up language (XML) file, or a record in a data file. Attributes are associated with a business object. In one example, a customer business object may be associated with a series of attributes including a customer number uniquely identifying the customer, a first name, a last name, an electronic mail address, a mailing address, a daytime telephone number, an evening telephone number, date of first purchase by the customer, date of the most recent purchase by the customer, birth date or age of customer, and the income level of customer. In another example, a sales order business object may include a customer number of the purchaser, the date on which the sales order was placed, and a list of products, services, or both products and services purchased.
The data [0027] warehouse computer system 110 stores a particular portion of data, here referred to as data warehouse 165. The data warehouse 165 is a central repository of data, extracted from transaction computer system 120 or 130 such as business objects 162 or 164. The data in the data warehouse 165 is used for special analyses, such as data mining analyses used to identify relationships among data. The results of the data mining analysis also are stored in the data warehouse 165.
The data [0028] warehouse computer system 110 includes an automated data mining process 168 having a data warehouse upload process 170 and a data mining analysis process 172. The data warehouse upload process 170 includes executable instructions for automatically extracting, transmitting and loading data from the transaction computer systems 120 and 130 to the data warehouse computer system 110. The data mining analysis process 172 includes executable instructions for triggering a data mining analysis in the data warehouse computer system 110, and enriching the data in the data warehouse 165 with new attributes determined by the data mining analysis, as described more fully below.
In some implementations, the data [0029] warehouse computer system 110 also may include a data mining mart 174 that temporarily stores data from the data warehouse 165 for use in data mining. In such a case, the data mining analysis process 172 also may extract data from the data warehouse 165, store the extracted data to the data mining mart 174, trigger a data mining analysis that operates on the data from the data mining mart 174, and enrich the data in the data warehouse 165 with the new attributes determined by the data mining analysis.
The data [0030] warehouse computer system 110 is capable of delivering and exchanging data with the transaction computer systems 120 and 130 through a wired or wireless communication pathway 176 and 178, respectively. The data warehouse computer system 110 also is able to communicate with the on-line client 115 that is connected to the computer system 110 through a communication pathway 176.
The data [0031] warehouse computer system 110, the transaction computer systems 120 and 130, and the on-line client 115 may be arranged to operate within or in concert with one or more other systems, such as, for example, one or more LANs (“Local Area Networks”) and/or one or more WANs (“Wide Area Networks”). The on-line client 115 may be a general-purpose computer that is capable of operating as a client of the application program (e.g., a desktop personal computer, a workstation, or a laptop computer running an application program), or a more special-purpose computer (e.g., a device specifically programmed to operate as a client of a particular application program). The on-line client 115 uses communication pathway 182 to communicate with the data warehouse computer system 110. For brevity, FIG. 1 illustrates only a single on-line client 115 for system 100.
At predetermined times, the data [0032] warehouse computer system 110 initiates an automated data mining process. This may be accomplished, for example, through the use of a task scheduler (not shown) that initiates the automated data mining process at a particular day and time. In general, the automated data mining process 1) uses the data warehouse upload process 170 to initiate the extraction, transformation and loading of data to the data warehouse 165 from the source systems 120 and 130, and 2) uses the data mining analysis process 172 to initiate a data mining run that creates new attributes by performing a special analysis of the data and loads the new attributes to the data warehouse 165 without user manipulation. A particular automated data mining run may be scheduled as a recurring event based on the occurrence of a predetermined time or date (such as the first day of a month, every Saturday at one o'clock a.m., or the first day of a quarter). Examples of automated data mining processes are described more fully in FIGS. 3-5.
More specifically, the data [0033] warehouse computer system 110 uses the automated data mining process 168 to initiate the data warehouse upload process 170. The data warehouse upload process 170 extracts or copies a portion of data, such as all or some of business objects 162, from the data storage 146 of the transaction computer system 120. The extracted data is transmitted over the connection 176 to the data warehouse computer system 110, where the extracted data are stored in data warehouse 165. The data warehouse computer system 110 also may transform the extracted data from a format suitable to computer system 110 into a different format that is suitable for the data warehouse computer system 110. Similarly, the data warehouse computer system 110 may extract a portion of data from data storage 154 of the computer system 130, such as all or some of business objects 164, transmit the extracted data over connection 178, store the extracted data in the data warehouse 165, and optionally transform the extracted data.
After the data have been extracted from the source computer systems (here, [0034] transaction computer systems 120 and 130), the automated data mining process 168 initiates the data mining analysis 172. The data mining analysis 172 performs a particular data mining procedure to analyze data from the data warehouse 165, enrich the data with new attributes, and store the enriched data in the data warehouse 165. A particular data mining procedure also may be referred to as a data mining run. There are different types of data mining runs. A data mining run may be a training run in which data relationships are determined, a prediction run that applies a determined relationship to a collection of data relevant to a future event, such as a customer failing to renew a service contract or make another purchase, or both a training run and a prediction run. The prediction run results in the creation of a new attribute for each business object in the data warehouse 165. The creation of a new attribute may be referred to as data enrichment. For example, when the data mining run predicts the likelihood that each customer will churn, an attribute for the likelihood of churn for each customer is stored in the data warehouse 165. That is, the data warehouse 165 is enriched with the new attribute.
The combination of the data warehouse upload [0035] process 170 and the data mining analysis 172 in the automated data mining process 168 may increase the coupling of the data mining with the upload of new data to the data warehouse, which, in turn, may reduce the time until the results of new data mining analyses are available. The combination of the data warehouse upload process 170 and the data mining analysis 172 in the automated data mining process 168 also enable the use of the same monitoring process to monitor both the data warehouse load process 170 and the data mining analysis process 172, which, in turn, may help simplify the monitoring of the automated data mining process 168.
The data [0036] warehouse computer system 110 also includes a data warehouse monitor 180 that reports on the administration of the automated data mining process 168. For example, an end user of online client 115 is able to view when an automated data mining process is scheduled to next occur, the frequency or other basis on which the automated data mining process is scheduled, and the status of the automated data mining process. For example, the end user may be able to determine that the automated data mining process 168 is executing. When the automated data mining process 168 is executing, the end user may be able to view the progress and status of each of the steps within the data mining method. For example, the end user may be able to view the time that the data warehouse upload process 168 was initiated. The ability to monitor the execution of the automated data mining process may be useful to ensure that the automated data mining process 168 is operating as desired. In some implementations, when a problem is detected in the automation of a data mining process, a notification of the problem may be sent to an administrator for the data warehouse or other type of end user. The use of the data warehouse monitor 180 with both the data upload process 170 and the data mining analysis 172 may be advantageous. For example, a system administrator or another type of user need only access a single monitoring process (here, data warehouse monitor 180) to monitor both sub-processes (here, the data upload process 170 and the data mining analysis 172). The use of the same monitoring process for different sub-processes may result in consistent process behavior across the different sub-processes. The use of the same monitoring process also may reduce the amount of training required for system administrators to be able to use the data warehouse monitor 180.
The ability to trigger special analyses directly after having loaded new data in a data warehouse environment to enrich the newly loaded data by new attributes may be useful. Multiple users, often geographically or organizationally distributed, are typically responsible for performing different aspects of the process, all aspects of which must be completed before the newly loaded data is enriched by the special analyses. This may result in a delay from the time when the transaction data is available for analysis to the time when the results of the analysis are available. The delay may be significant or may negatively impact the business enterprise. For example, a business enterprise may be harmed by lost sales by the delay of product arrangements in a retail store based on a cross-selling analysis or by the delay of a promotional marketing campaign to target at-risk customers. [0037]
FIG. 2 shows the [0038] results 200 of enriching the data stored in the data warehouse based on an automated data mining process. The results 200 are stored in a relational database system that logically organizes data into a database table. The database table arranges data associated with an entity (here, a customer) in a series of columns 210-216 and rows 220-223. Each column 210, 211, 212, 213, 214, 215, or 216 describes an attribute of the customer for which data is being stored. Each row 220, 221, 222 or 223 represents a collection of attribute values for a particular customer number by a customer identifier 210. The attributes 210-215 were extracted from a source system, such as a customer relationship management system, and loaded into the data warehouse. The attribute 216 represents the likelihood of churn for each customer 220, 221, 222 and 223. The likelihood-of-churn attribute 216 was created and loaded into the data warehouse by an automated data mining process, such as the automated data mining process described in FIGS. 1, 3 and 4.
FIG. 3 illustrates an automated [0039] data mining process 300. The automated data mining process 300 may be performed by a processor on a computing system, such as data warehouse computer system 110 of FIG. 1. The automated data mining processor is directed by a method, script, or other type of computer program that includes executable instructions for performing the automated data mining process 300. An example of such a collection of executable instructions is the automated data mining process 168 of FIG. 1.
The automated [0040] data mining process 300 includes an extract, transform and load (ETL) sub-process 310, a data mining sub-process 320, and a data enrichment sub-process 330. The automated data mining process 300 begins at a predetermined time and date, typically a recurring predetermined time and date. In some implementations, a system administrator or another type of user may manually initiate the automated data mining process 300. In such a case, the automated data mining process 300, once initiated, automatically triggers sub-processes 310, 320 and 330 without requiring further user manipulation.
For example, a churn management automated data mining process may be associated with a script that includes a remote procedure call to extract data from one or more source systems in [0041] step 340, a computer program to transform the extracted data, a database script for loading the data warehouse with the transformed data, and a computer program to perform a churn analysis on the customer data in the data warehouse. Thus, once the script for the churn management automated data mining process has been initiated, by a task scheduler or other type of computer program, the tasks are then automatically triggered based on the completion of the previous script component.
The data warehouse processor extracts from a source system appropriate data and transmits the extracted data to the data warehouse (step [0042] 340). For example, the data warehouse processor may execute a remote procedure call on the source system to trigger the extraction and transmission of data from the source system to the computer system on which the data warehouse resides. Alternatively, the data warehouse processor may connect to a web service on the source system to request the extraction and transmission of the data. Typically, the data to be extracted is data from a transaction system, such as an OLTP system. The data extracted may be a complete set of the appropriate data (such as all sales orders or all customers) from the source system, or may be only the data that has been changed since the last extraction. The processor may extract and transmit the data from the source system in a series of data groups, such as data blocks. The extraction may be performed either as a background process or an on-line process, as may the transmission. The ability to extract and transmit data in groups, extract and transmit only changed data, and extract and transmit as a background process may collectively or individually be useful, particularly when a large volume of data is to be extracted and transmitted.
In some implementations, the extracted data also may be transformed from the format used by the source system to a different format used by the data warehouse (step [0043] 345). The data transformation may include transforming data values representing a particular attribute to a different field type and length that is used by the data warehouse. The data transformation also may include translating a data code used by the source system to a corresponding but different data code used by the data warehouse. For example, the source system may store a country value using a numeric code (for example, a “1” for the United States and a “2” for the United Kingdom) whereas the data warehouse may store a country value as a textual abbreviation (for example, “U.S.” for the United States and “U.K.” for the United Kingdom). The data transformation also may include translating a proprietary key numbering system in which primary keys are created by sequentially allocating numbers within an allocated number range to a corresponding GUID (“globally unique identifier”) key that is produced from a well-known algorithm and is able to be processed by any computer system using the well-known algorithm. The processor may use a translation table or other software engineering or programming techniques to perform the transformations required. For example, the processor may use a translation table that translates the various possible values from one system to another system for a particular data attribute (for example, translating a country code of “1” to “U.S.” and “2” to “U.K.” or translating a particular proprietary key to a corresponding GUID key).
Other types of data transformation also may be performed by the data warehouse processor. For example, the processor may aggregate data or generate additional data values based on the extracted data. For example, the processor may determine a geographic region for a customer based on the customer's mailing address or may determine the total amount of sales to a particular customer that is associated with multiple sales orders. [0044]
The data warehouse processor loads the extracted data into data storage associated with the data warehouse, such as the [0045] data warehouse 165 of FIG. 1 (step 350). The data warehouse processor may execute a computer program having executable instructions for loading the extracted data into the data storage and identified by the automated data mining method directing the process 300. For example, a database script may be executed that includes database commands to load the data to the data warehouse. The use of a separate computer program for loading the data may increase the modularity of the data mining method, which, in turn, may improve the efficiency of modifying the automated data mining process 300. Steps 340-350 may be referred to as the ETL sub-process 310.
After completing the [0046] ETL sub-process 310, the data warehouse processor automatically triggers a data mining process (step 360). This may be accomplished, for example, by using a script or other type of computer program to control the execution of multiple programs.
The data warehouse processor performs a data mining run (step [0047] 365). To do so, the data warehouse processor may apply a data mining model or another type of collection of data mining rules that defines the type of analysis to be performed. The data mining model may be applied to all or a portion of the data in the data warehouse. In some implementations, the data warehouse processor may store the data to be used in the data mining run in transient or persistent storage peripheral to the data warehouse processor where the data is accessed during the data mining run. This may be particularly advantageous when the data warehouse includes a very large volume of data and/or the data warehouse also is used for OLAP processing. In some cases, the storage of the data to transient or persistent storage may be referred to as extracting or staging the data to a data mart for data mining purposes.
The data mining run may be a training run or a prediction run. In some implementations, both a training run and a prediction run may be performed during [0048] process 300. The results of the data mining run are stored in temporary storage. To do so, the data warehouse process may copy the results stored in the temporary data structure to the data warehouse. For example, in a customer churn analysis data mining process, the likelihood of churn for each customer may be assessed and stored in a temporary results data structure. Steps 360-365 may be referred to as a data mining sub-process 320.
When the [0049] data mining sub-process 320 is completed, the data warehouse processor stores the data mining results in the data warehouse (step 370). For example, a new column for the data mining results may be added to a table in a relational data management system being used for the date warehouse. In a customer churn analysis data mining process, the likelihood of churn for each customer may be added as a new attribute in the data warehouse and appropriately populated with the likelihood data generated when the data mining run was performed in step 365. The process of storing the data created by the data mining run in the data warehouse may be referred to as a data enrichment sub-process 330.
In one example, the [0050] process 300 may be used for an automated customer-churn data mining process. A system administrator develops computer programs, each of which are executed to accomplish a portion of the automated customer-churn data mining process. The system administrator also develops a script that identifies each of the computer programs to be executed and the order in which the computer programs are to be executed to accomplish the automated customer-churn data mining process. The system administrator, using a task scheduling program schedules the automated customer-churn data mining script to be triggered on a monthly basis, such as on the first Saturday of each month and beginning at one o'clock a.m.
At the scheduled time, the task scheduling program triggers the data warehouse processor to execute the automated customer-churn data mining script. The data warehouse processor executes a remote procedure call in a customer relationship management system to extract customer data and transmit the data to the data warehouse computer system. The data warehouse computer system receives and stores the extracted customer data. The data warehouse processor executes a computer program, as directed by the executing automated customer-churn data mining process script, to transform the customer data to a format usable by the data warehouse. [0051]
The data warehouse processor continues to execute the automated customer-churn data mining process script, which then triggers a data mining training run to identify hidden relationships within the customer data. Specifically, the characteristics of customers who have not renewed a service contract in the last eighteen months are identified. The characteristics identified may include, for example, an income above or below a particular level, a geographic region in which the non-returning customer resides, the types of service contract that were not renewed, and the median age of a non-renewing customer. [0052]
The data warehouse processor then, under the continued direction of the automated customer-churn data mining process script, triggers a data mining prediction run to identify particular customers who are at risk of not renewing a service contract, the prediction is made based on the customer characteristics identified in the data mining training run. The data warehouse processor determines a likelihood-of-churn for each customer. The data warehouse is enriched with the likelihood-of-churn for each customer such that a likelihood-of-churn attribute is added to the customer data in the data warehouse and the likelihood-of-churn value for each value is stored in the new attribute. [0053]
In some implementations, when a subsequent likelihood-of-churn value for a customer is determined, such as a likelihood-of-churn value for a customer that is determined in the following month, the likelihood-of-churn value from the previous data mining prediction run may be replaced so that a customer has only one likelihood-of-churn value at any time. In contrast, some implementations may store the new likelihood-of-churn value each month, in addition to a previous value for the likelihood-of-churn, to develop a time-dependent prediction—that is, a new prediction for the same type of prediction is stored each time a prediction run is performed for a customer. The time-dependent prediction may help improve the accuracy of the data mining training runs because the predicted values may be monitored over time and compared with actual customer behavior. [0054]
FIG. 4 illustrates another example of an automated data mining process. In contrast to the automated [0055] data mining process 300 of FIG. 3, automated data mining process 400 replicates data from a data warehouse, such as data warehouse 165 in FIG. 1, to a data mining mart, such as data mining mart 174 of FIG. 1. The data mining process 400 then performs the data mining analysis on data in the data mart, and stores the data mining results as enriched data in the data warehouse.
The automated [0056] data mining process 400 may be performed by a processor on a computing system, such as data warehouse computer system 110 of FIG. 1. The automated data mining processor is directed by a method, script, or other type of computer program that includes executable instructions for performing the automated data mining process 400. An example of such a collection of executable instructions is the automated data mining process 168 of FIG. 1.
The automated [0057] data mining process 400 includes an extract, transform and load (ETL) sub-process 410, a data mining sub-process 420 that uses a data mart, and a data enrichment sub-process 430. The automated mining process 400 begins at a predetermined time and date, typically a recurring predetermined time and date. The ETL sub-process 410 extracts data from a transactional processing or other type of source system and loads the data to a data warehouse, as described previously with respect to ETL sub-process 310 of FIG. 3.
After completing the [0058] ETL sub-process 410, the data warehouse processor automatically triggers a data mining run, as described with respect to step 360 in FIG. 3 (step 440). The data warehouse processor copies data from the data warehouse to the data mining mart for use in a data mining run (step 450). For example, when the data warehouse and the data mining mart are located on the same computer system, the data warehouse processor may insert into database tables of a data mining mart a copy of some of the data rows stored in the data warehouse. Alternatively, when the data warehouse is located on a different computer system than the computer system on which the data mart is located, the data warehouse processor may extract data from the data warehouse on a computer system and transmit the data to the data mart located on a different computer system. The data warehouse processor then may execute a remote procedure call or other collection of executable instructions to load data into the data mart. In some implementations, the data warehouse processor may replicate data from the data warehouse to the data mining mart—that is, the data warehouse processor copies the data to the data mining mart and synchronizes the data mining mart with the data warehouse such that changes made to one of the data warehouse or the data mining mart are reflected in all other of the data warehouse or the data mining mart. In some implementations, the data warehouse processor may transform the data from the data warehouse before storing the data in the data mining mart.
The data warehouse processor then performs a data mining run, as described in [0059] step 365 in FIG. 3, using data in the data mining mart (step 460). The steps 440-460 may be referred to as a data mining sub-process 420. When the data mining sub-process 420 is completed, the data warehouse processor stores the data mining results in the data warehouse (step 470), as described in step 370 and sub-process 330 in FIG. 3.
FIG. 5 depicts the components of a [0060] software architecture 500 for an automated data mining process. The software architecture 500 may be used to implement the automated data mining process 300 described in FIG. 3 or the automated data mining process 400 described in FIG. 4. The software architecture 500 may be implemented, for example, on computer system 110 of FIG. 1. FIG. 5 also illustrates a data flow and a process flow using the components of the software architecture to implement the automated data mining process 400 in FIG. 4.
The [0061] software architecture 500 includes an automated data mining task scheduler 510, a transaction data extractor 515, and a data mining extractor 520. The software architecture also includes a transaction processing data management system 525 for a transaction processing system, such as transaction computer system 120 or transaction computer system 130 in FIG. 1. The software architecture also includes a data warehouse 530, such as the data warehouse 165 in FIG. 1, and a data mart 535, such as the optional data mart 174 in FIG. 1.
One example of the automated data [0062] mining task scheduler 510 is a process chain for triggering the transaction data extractor 515 and the data mining extractor 520 at a predetermined date and time. In general, a process chain is a computer program that defines particular tasks that are to occur in a particular order at a predetermined date and time. For example, a system administrator or another type of user may schedule the process chain to occur at regular intervals, such as at one o-clock a.m. the first Saturday of a month, every Sunday at eight o'clock a.m., or at two o'clock a.m. on the first day and the fifteenth day of each month. A process chain may include dependencies between the defined tasks in the process chain such that a subsequent task is not triggered until a previous task has been successfully completed. In this example, the automated data mining task scheduler 510 is a process chain that calls two extractor processes: the transaction data extractor 515 and the data mining extractor 520. The data mining extractor 520 is only initiated after the successful completion of the transaction data extractor 515.
The automated data [0063] mining task scheduler 510 starts the transaction data extractor 515 at a predetermined date and time, as illustrated by process flow 542. In general, an extractor is a computer program that performs the extraction of data from a data source using a set of predefined settings. Typical settings for an extractor include data selection settings that identify the particular data attributes and data filter settings that identify the criteria that identifies the particular records to be extracted. For example, an extractor may identify three attributes—customer number, last purchase date, and amount of last purchase—that are to be extracted for all customers that are located in a particular geographic region. The extractor then reads the attribute values for the records that meet the filter condition from the data source, maps the data to the attributes included in the data warehouse, and loads the data to the data warehouse. An extractor also may be referred to as an upload process.
The [0064] transaction data extractor 515 extracts, using predefined settings, data from the transaction processing data management system, as indicated by data flow line 544, and transforms the data as necessary to prepare the data to be loaded to the data warehouse 530. The transaction data extractor 515 then loads the extracted data to the data warehouse 530, as indicated by data flow 546. After the extracted data has been loaded, the transaction data extractor 515 returns processing control to the automated data mining task scheduler 510, as indicated by process flow 548. When returning processing control, the transaction data extractor 515 also reports the successful completion of the extraction.
Based on the successful completion of the [0065] transaction data extractor 515, the automated data mining task scheduler 510 starts the data mining extractor 520, as illustrated by process flow 552. In general, the data mining extractor initiates a data mining process using the newly loaded transaction data in the data warehouse 530. The data mining process analyzes the data and writes the results back to the data warehouse.
First, the [0066] data mining extractor 520 extracts data from the data warehouse 530 (function 555), as illustrated by data flow 556, and loads the extracted data to the data mart 535, as illustrated by data flow 558, for use by the data mining analysis. The data mining extractor 520 then performs a data mining training analysis (function 560) using the data from the data mart 535, as illustrated by data flow 562. The data mining extractor 520 updates the appropriate data mining model in data mining model 565 with the results of the data mining training analysis, as illustrated by data flow 564.
The [0067] data mining extractor 520 uses the results of the data mining training analysis from a data mining model 564, as illustrated by data flow 566, to perform a data mining prediction analysis (function 568). The data mining extractor 520 stores the results of the data mining prediction analysis in the data mart 535, as illustrated by data flow 569.
The [0068] data mining extractor 520 then performs a data enrichment function (function 570) using the results from the data mart 535, as illustrated by data flow 572, to load the data mining results into the data warehouse 530, as illustrated by data flow 574. After enriching the data warehouse 530 with the data mining analysis results, the data mining extractor 520 returns processing control to the automated data mining task scheduler 510, as depicted by process flow 576. When returning processing control, the data mining extractor 520 also reports to the automated data mining task scheduler 510 the successful completion of the data mining analyses and enrichment of the data warehouse. To do so, the data mining extractor 520 may report a return code that is consistent with a successful process.
The use of a task scheduler, here in the form of a process chain, to link the task of extracting the transaction data from a source system with the task of performing the data mining process may be useful. For example, the process for loading transaction data to the data warehouse is combined with an immediate data mining analysis and enrichment of the data warehouse data with the results of the analysis. The linkage of the transactional data availability with the automatic performance of the data mining analysis may reduce, perhaps even substantially reduce, the lag between the time at which the transaction data first becomes available in the data warehouse and the time at which the data enriched with data mining analysis results becomes available in the data warehouse. [0069]
There also may be advantages in a type of data loading computer program (here, an extractor) for both (1) the load of the transaction data to the data warehouse and (2) the performance of the data mining analysis and the enrichment of the data warehouse data with the data mining analysis results. This may be particularly true when a data mart is used for temporary storage of data from the data warehouse in which an extraction is to be performed. For example, in some data warehousing systems, a task scheduler may be available only for use with a data loading process and may not be available for general use with a data mining process. In such a case, wrapping the data mining process within a data loading process allows a data mining process to be automatically triggered at a predetermined time on a scheduled basis (such as daily, weekly or monthly at a particular time). [0070]
More generally, the use of the same types of techniques, procedures and processes for both a data extraction process and an analytical process of data mining run may be useful. For example, it may enable the use of a common software tool for administering a data warehouse and a data mining run, particularly when data is extracted from a data warehouse for use by a data mining run. The use of the same techniques, procedures and processes for both a data extraction process and an analytical process also may make a function available to both processes when the function was previously available only to one of the analytical process or the data extraction process. It also may encourage consistent behavior from a data warehouse process and a data mining analysis, which may, in turn, reduce the amount of training required by a system administrator. [0071]
FIG. 6 depicts a [0072] process 600 supported by a data mining workbench for defining an automated data mining process. The data mining workbench presents a user interface to guide a user to define a particular type of automated data mining analysis. In general, the data mining workbench uses a generic template for a particular type of data mining analysis, receives user-entered information applicable to the generic template, receives scheduling information from the user, and generates a particular automated data mining process.
The [0073] process 600 to define an automated data mining process begins when the data mining workbench presents a user interface for the user to enter identifying information for the data mining analysis process being defined (step 610). For example, the user may enter a name or another type of identifier and a description of the data mining analysis.
The data mining workbench then presents an interface that allows a user to identify the data mining analysis template to be used (step [0074] 620). For example, the data mining workbench may present a list of data mining analysis templates, such as a template for particular type of a customer loyalty analysis or a template for a particular type of cross-selling analysis, from which the user selects.
Based on the data mining analysis template selected, the data mining workbench presents an appropriate interface to guide the user through the process of entering the user-configuration data mining information to configure the template for the particular analysis being defined (step [0075] 630). In one example of defining an automated data mining analysis for determining the effect of a particular marketing campaign, the user enters an identifier for the particular marketing campaign to be analyzed, the particular customer attributes to be analyzed, the attributes to be measured to determine the effect of the marketing campaign (such as sales attribute), and the filter criteria for selecting the records to be analyzed. The data mining analysis template includes a portion for the transaction data extraction, such as transaction data extractor 515 in FIG. 5, and a portion for data mining extraction, such as data mining extractor 520 in FIG. 5.
The user then schedules when the data mining analysis process should be automatically triggered (step [0076] 640). For example, the user may identify a recurring pattern of dates and times for triggering the data mining analysis. This may be accomplished through the presentation of a calendar or the presentation of a set of schedule options from which the user selects.
The data mining workbench then stores a version of the generic data mining analysis template with the user-entered information (step [0077] 650). To do so, for example, the data mining workbench may use the name or identifier entered by the user as the name of the stored automated data mining process. The automated data mining process may be added to a task scheduler and scheduled based on the information the user entered.
Although the techniques and concepts described above refer to a single data mining process, the applicability of the techniques and concepts is not limited to a single data mining process. For example, a particular data warehouse may be used for, and typically is used for, many different data mining processes, many of which may benefit from being automated as described herein. [0078]
A number of implementations of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other implementations are within the scope of the following claims. [0079]

Claims

What is claimed is:

1. A method for conducting a data mining process, the method comprising:

triggering the data mining process wherein:

the data mining process includes an analytical process,

the triggering is based on the presence of data in a data source that is used for analytical processing, and

the analytical process uses a procedure that also is usable in a data extraction process;

creating a data attribute by performing the analytical process on data from the data source after the analytical process has been triggered; and

storing the created data attribute in the data source.

2. The method of claim 1 further comprising extracting data from a data source used for transaction processing.

3. The method of claim 2 wherein a person initiates at most the step of extracting data from the data source used for transaction processing.

4. The method of claim 2 further comprising loading the extracted data to the data source that is used for analytical processing.

5. The method of claim 4 wherein a person initiates at most the step of loading the extracted data.

6. The method of claim 1 further comprising:

extracting data from the data source that is used for analytical processing; and

loading the data extracted from the data source that is used for analytical processing to temporary data storage,

wherein performing the analytical process comprises performing the analytical process using data stored in the temporary data storage.

7. The method of claim 1 wherein triggering the analytical process based presence of data in the data source that is used for analytical processing comprises triggering an analytical process based on the completion of a computer program for loading data to the data source that is used for analytical processing.

8. The method of claim 1 further comprising triggering, based on an occurrence of a predetermined date and time, loading the data extracted from a data source used for transaction processing to the data source that is used for analytical processing.

9. The method of claim 1 further comprising triggering, based on an occurrence of a predetermined date and time, extracting data from a data source used for transaction processing and loading the extracted data to the data source that is used for analytical processing

10. The method of claim 1 wherein performing the analytical process comprises determining a relationship between two data values in the data source.

11. The method of claim 10 wherein performing the analytical process comprises determining a relationship between two data values that predict a likelihood of whether a particular customer will fail to purchase a service or product in the future.

12. The method of claim 10 wherein performing the analytical process comprises identifying products that are purchased in the same transaction.

13. The method of claim 10 wherein performing the analytical process comprises identifying services that are purchased in the same transaction.

14. The method of claim 10 wherein performing the analytical process comprises applying a previously-determined relationship between two data values to data in the data source.

15. The method of claim 14 wherein applying a previously-determined relationship comprises determining a likelihood of whether a particular customer will fail to purchase a service or a product in the future based on characteristics associated with customers who have been identified as failing to purchase a service or a product.

16. The method of claim 1 performing the analytical process comprises determining a relationship between two data values in the data source and applying the determined relationship to data in the data source.

17. The method of claim 16 wherein:

determining a relationship between two data values in the data source comprises determining a relationship between two data values that predict a likelihood of whether a particular customer will fail to purchase a service or product in the future, and

applying the determined relationship to data in the data source comprises determining a likelihood of whether a particular customer will fail to purchase a service or a product in the future based on the relationship determined between two data values that predict a likelihood of whether a particular customer will fail to purchase a service or product in the future.

18. A method for conducting a data mining process, the method comprising:

triggering the data mining process wherein:

the data mining process includes an analytical process,

extracting data from the data source that is used for analytical processing;

creating a data attribute by performing the analytical process on data in the temporary data storage; and

storing the created data attribute in the data source that is used for analytical processing.

19. The method of claim 18 further comprising:

extracting data from a data source used for transaction processing, and

comprising loading the extracted data to the data source that is used for analytical processing.

20. The method of claim 19 wherein a person initiates at most the step of extracting data from the data source used for transaction processing.

21. The method of claim 18 wherein performing the analytical process comprises determining a relationship between two data values in the data source.

22. A computer-readable medium or propagated signal having embodied thereon a computer program configured to conduct a data mining process, the medium or signal comprising one or more code segments configured to:

trigger the data mining process wherein:

the data mining process includes an analytical process,

create a data attribute by performing the analytical process on data from the data source after the analytical process has been triggered; and

store the created data attribute in the data source.

23. The medium or signal of claim 22 wherein the one or more code segments are further configured to:

extract data from the data source that is used for analytical processing; and

load the data extracted from the data source that is used for analytical processing to temporary data storage,

wherein the one or more code segments configured to perform the analytical process comprise one or more code segments configured to perform the analytical process using data stored in the temporary data storage.

24. The medium or signal of claim 22 wherein the one or more code segments configured to trigger the analytical process comprise one or more code segments configured to trigger an analytical process based on the completion of a computer program for loading data to the data source that is used for analytical processing.

25. The medium or signal of claim 22 wherein the one or more code segments are further configured to trigger, based on an occurrence of a predetermined date and time, loading the data extracted from a data source used for transaction processing to the data source that is used for analytical processing.

26. The medium or signal of claim 22 wherein the one or more code segments are further configured to trigger, based on an occurrence of a predetermined date and time, extracting data from a data source used for transaction processing and loading the extracted data to the data source that is used for analytical processing.

27. The medium or signal of claim 22 wherein the one or more code segments configured to perform the analytical process comprise one or more code segments configured to determine a relationship between two data values in the data source.

28. A system for conducting a data mining process, the system comprising a processor connected to a storage device and one or more input/output devices, wherein the processor is configured to:

trigger the data mining process wherein:

the data mining process includes an analytical process,

store the created data attribute in the data source.

29. The system of claim 28 wherein the processor is further configured to:

extract data from the data source that is used for analytical processing;

load the data extracted from the data source that is used for analytical processing to temporary data storage; and

perform the analytical process using data stored in the temporary data storage.

30. The system of claim 28 wherein the processor is configured to trigger an analytical process based on the completion of a computer program for loading data to the data source that is used for analytical processing.

31. The system of claim 28 wherein the processor is further configured to trigger, based on an occurrence of a predetermined date and time, loading the data extracted from a data source used for transaction processing to the data source that is used for analytical processing.

32. The system of claim 28 wherein the processor is further configured to trigger, based on an occurrence of a predetermined date and time, extracting data from a data source used for transaction processing and loading the extracted data to the data source that is used for analytical processing.

33. The system of claim 28 wherein the processor is configured to determine a relationship between two data values in the data source.

34. A method for defining an automated data mining process, the method comprising:

presenting a user interface for:

identifying a template for a type of automated data mining process for triggering an analytical process, the analytical process using a procedure that also is usable in a data extraction process, based on the presence of data in a data source that is used for analytical processing, creating a data attribute by performing the analytical process on data from the data source after the analytical process has been triggered, and storing the created data attribute in the data source; and

entering information for defining the automated data mining process;

associating the entered information with the identified template; and

storing the associated entered information with the identified template as a computer program configured to perform the automated data mining process.