WO2017030535A1 - Dataset partitioning - Google Patents

Dataset partitioning Download PDF

Info

Publication number
WO2017030535A1
WO2017030535A1 PCT/US2015/045307 US2015045307W WO2017030535A1 WO 2017030535 A1 WO2017030535 A1 WO 2017030535A1 US 2015045307 W US2015045307 W US 2015045307W WO 2017030535 A1 WO2017030535 A1 WO 2017030535A1
Authority
WO
WIPO (PCT)
Prior art keywords
items
groups
dataset
classifier
classifiers
Prior art date
Application number
PCT/US2015/045307
Other languages
French (fr)
Inventor
George Forman
Renato Keshet
Hila Nachlieli
Original Assignee
Hewlett-Packard Development Company, L. P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L. P. filed Critical Hewlett-Packard Development Company, L. P.
Priority to PCT/US2015/045307 priority Critical patent/WO2017030535A1/en
Publication of WO2017030535A1 publication Critical patent/WO2017030535A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • Clustering or partitioning of data is typically the task of grouping a set of items or objects in such a way that objects in the same group (e.g., cluster) are more similar to each other than to those in other groups (e.g., clusters).
  • a user provides a clustering application with a plurality of objects that are to be clustered.
  • the clustering application typically generates clusters from the plurality of objects in an unsupervised manner, where the clusters may be of interest to the user.
  • Figure 1 is a schematic illustration of an example system for generating groups of items in a dataset by using a classifier selected based on a user's input in accordance with an implementation of the present disclosure.
  • Figure 2 illustrates a flowchart showing an example of a method for generating groups of items in a dataset by using a classifier selected based on a user's input in accordance with an implementation of the present disclosure.
  • Figure 3 illustrates a flowchart showing an example of a method for scoring a plurality of classifiers in accordance with an implementation of the present disclosure.
  • Figure 4 is an example block diagram illustrating a computer- readable medium in accordance with an implementation of the present disclosure.
  • clustering or partitioning of data has become increasingly popular in recent years. Many organizations use various data clustering methods and techniques to help them analyze and cluster different types of data (e.g., customer surveys, customer support logs, engineer repair notes, system logs, etc.).
  • group and “cluster” are to be used interchangeably and refer to a set of items that are grouped in a way such that items in the same group are more similar to each other than to those in other groups.
  • item and “object” are to be used interchangeably and refer to an element in a dataset (e.g., document, word, numerical values, etc.).
  • a user e.g., a domain expert
  • the purity of the returned clusters matters greatly to the domain expert. It is easier to recognize a topic if the cluster has high purity, ideally just a single topic. For typical, complex text domain data, determining the meaning and worth of a proposed cluster can take the user awhile examining its cases. Thus, it is best to provide a manageable list of cases that are most typical or central to the cluster, rather than return a much larger set of cases that may include some other topics mixed in.
  • the size of the cluster topic matters to the user. Although the returned cluster may be a small list of cases, the underlying topic that it informs the user about may be large. Users ordinarily prefer to discover the larger topics first, ideally working down the tail in order.
  • the proposed techniques leverage a user-provided guidance set, which is a set of items labeled or tagged into distinct categories or groups that are given or selected by a user.
  • the guidance set is used to evaluate the relevance of each of a set of trained classifiers. In other words, the guidance set is not used to train a classifier, but is only used to evaluate an already trained classifier on this guidance set.
  • the best classifier selected from the evaluation is then applied to the residual items in the entire dataset (i.e., items for which relevant subgroups are sought, e.g., all items that are not in the known guidance set of items), and the output is used to deliver a relevant clustering of the dataset into a plurality of groups, which may be presented to the user.
  • the proposed techniques determine the largest of these clusters as the best to offer the user.
  • a processor may score a plurality of classifiers on a guidance set of example items in a dataset of items to select the best scoring classifier.
  • the processor may further determine a group of residual items in the dataset and may apply the selected classifier to the residual items in the dataset.
  • the processor may partition the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier, and may output at least one group from the plurality of groups.
  • FIG. 1 is a schematic illustration of an example system 10 for generating clusters or groups of items in a dataset by using a classifier selected based on a user's input.
  • the illustrated system 10 is capable of carrying out the techniques described below.
  • the system 10 is depicted as including at least one a computing device 100.
  • computing device 100 includes a processor 102, an interface 106, and a machine-readable storage medium 1 10.
  • processor 102 the processor 102
  • interface 106 includes a processor 102, an interface 106, and a machine-readable storage medium 1 10.
  • FIG. 1 is a schematic illustration of an example system 10 for generating clusters or groups of items in a dataset by using a classifier selected based on a user's input.
  • the illustrated system 10 is capable of carrying out the techniques described below.
  • the system 10 is depicted as including at least one a computing device 100.
  • computing device 100 includes a processor 102, an interface 106, and a machine-readable storage medium 1 10.
  • the computing device 100 may receive a dataset 150 that is to be clustered and may communicate with at least one interactive user interface 160 (e.g., graphical user interface, etc.).
  • the dataset 150 may include categorical data, numerical data, structured data, unstructured data, or any other type of data.
  • the data in the dataset 150 may be in structured form.
  • the data may be represented as a linked database, a tabular array, an excel worksheet, a graph, a tree, and so forth.
  • the data in the dataset 150 may be unstructured.
  • the data may be a collection of log messages, snippets from text messages, messages from social networking platforms, and so forth.
  • the data may be in semi- structured form.
  • the data in the dataset 150 may be represented as an array.
  • columns may represent features of the data
  • rows may represent data elements.
  • rows may represent a traffic incident
  • columns may represent features associated with each traffic incident, including weather conditions, road conditions, time of day, date, a number of casualties, types of injuries, victims' ages, and so forth.
  • the computing device 100 may be any type of a computing device and may include at least engines 120-140. In one implementation, the computing device 100 may be an independent computing device. Engines 120-140 may or may not be part of the machine-readable storage medium 1 10. In another alternative example, engines 120-140 may be distributed between the computing device 100 and other computing devices.
  • the computing device 100 may include additional components and some of the components depicted therein may be removed and/or modified without departing from a scope of the system that allows for carrying out the functionality described herein.
  • Processor 102 may be central processing unit(s) (CPUs), microprocessor(s), and/or other hardware device(s) suitable for retrieval and execution of instructions (not shown) stored in machine-readable storage medium 1 10.
  • Processor 102 may fetch, decode, and execute instructions to identify different groups in a dataset.
  • processor 102 may include electronic circuits comprising a number of electronic components for performing the functionality of instructions.
  • Interface 106 may include a number of electronic components for communicating with various devices.
  • interface 106 may be an Ethernet interface, a Universal Serial Bus (USB) interface, an IEEE 1394 (Firewire) interface, an external Serial Advanced Technology Attachment (eSATA) interface, or any other physical connection interface suitable for communication with the computing device.
  • interface 106 may be a wireless interface, such as a wireless local area network (WLAN) interface or a near-field communication (NFC) interface that is used to connect with other devices/systems and/or to a network.
  • WLAN wireless local area network
  • NFC near-field communication
  • the user interface 160 and the computing device 100 may be connected via a network.
  • the network may be a mesh sensor network (not shown).
  • the network may include any suitable type or configuration of network to allow for communication between the computing device 100, the user interface 160, and any other devices/systems (e.g., other computing devices, displays, etc.), for example, to send and receive data to and from a corresponding interface of another device.
  • any other devices/systems e.g., other computing devices, displays, etc.
  • Each of the engines 120-140 may include, for example, at least one hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory.
  • the engines 120-140 may be implemented as any combination of hardware and software to implement the functionalities of the engines.
  • the hardware may be a processor and the software may be a series of instructions or microcode encoded on a machine-readable storage medium and executable by the processor. Therefore, as used herein, an engine may include program code (e.g., computer executable instructions), hardware, firmware, and/or logic, or combination thereof to perform particular actions, tasks, and functions described in more detail herein in reference to Figures 2-4.
  • the scoring engine 120 may score a plurality of classifiers on a guidance set of example items in a dataset of items (e.g., the dataset 150) and may select the best scoring classifier.
  • the scoring engine 120 may apply each of the plurality of classifiers to the example items in the guidance set, may determine an accuracy of each of the plurality of classifiers to separate the example items in the guidance set to groups based on user's labels of the example items, and may determine the best scoring classifiers based on the accuracy of separation.
  • the scoring engine 120 may select the best scoring classifier from the plurality of classifiers.
  • the guidance set of example items in the dataset identifies groups of interest to a user. Various techniques may be used to score the plurality of classifiers.
  • the scoring engine 120 may select the best scoring classifier from the plurality of classifiers.
  • the analysis engine 130 may determine a group of residual items in the dataset (i.e., a set of unlabeled items for which relevant groupings or clusters are sought).
  • the residual items may include all items that are not already labeled (i.e., eliminating those items that are in the guidance set).
  • the residual items may include all unlabeled items that do not belong to the categories defined by the labels of the guidance set.
  • Various techniques may be used to determine a group of residual items in the dataset.
  • the classification engine 140 may apply the selected classifier (i.e., the classifier selected by the scoring engine 120) to the residual items in the dataset to order to partition the residual items into a plurality of groups corresponding to the categories output by the classifier.
  • the engine 140 may remove any group from the plurality of groups that corresponds to a group from the guidance set of example items.
  • the selected classifier may be applied to the guidance set, and then the classification engine 140 may remove output groups that receive any (or many) of the guidance set items.
  • the classification engine 140 may partition the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier, and may output at least one group from the plurality of groups (e.g., on the interface 160).
  • Figure 2 illustrates a flowchart showing an example of a method 200 for generating clusters or groups of items in a dataset by using a classifier selected based on a user's input.
  • execution of the method 200 is described below with reference to the system 10, the components for executing the method 200 may be spread among multiple devices/systems.
  • the method 200 may be implemented in the form of executable instructions stored on a machine-readable storage medium, and/or in the form of electronic circuitry.
  • the method 200 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100). In other examples, the method may be executed by another processor in communication with the system 10.
  • a computing device e.g., processor 102 of device 100
  • the method may be executed by another processor in communication with the system 10.
  • Various elements or blocks described herein with respect to the method 200 are capable of being executed simultaneously, in parallel, or in an order that differs from the illustrated serial manner of execution.
  • the method 200 is also capable of being executed using additional or fewer elements than are shown in the illustrated examples.
  • the method 200 begins at 210, where at least one processor may score a plurality of classifiers on a guidance set of example items in a dataset (e.g., 150) of items.
  • the processor may use a guidance set of example items in the dataset 150 to score the plurality of classifiers.
  • a guidance set of example items in a dataset of items may include a portion of the entire dataset 150 and may identify groups of interest to a user (i.e., by using user labels on the guidance set).
  • a user may wish to divide a dataset of items in a customer support log based on the type of support issues that are received.
  • the guidance set of example items in the dataset 150 identifies groups (e.g., A, B, C, etc.) of interest to the user.
  • a user may label example items (e.g., log entries, documents, etc.) in the guidance set by providing labels on some of the example items.
  • the user may provide examples for some of the topics/groups that is he or she interested in in the guidance set of items.
  • this information may be used to score the plurality of classifiers.
  • An example technique for scoring the classifiers is described below in relation to Figure 3.
  • the plurality of classifiers may include a library (not shown) of multi-class classifiers (i.e., ready- to-use classifiers, not classifier methodologies such as Naive Bayes).
  • Each such classifier may be at least one of a previously-trained machine-learning classifier and a simple multi-class classifier formed from a list of related terms.
  • the processor 102 may form a simple multi-class classifier using a list of related terms (i.e., keywords or phrases).
  • list 1 may include the terms: hard disk drive, mouse, keyboard, screen adapter, USB, etc.
  • list 2 may include the terms: jam, ink, cartridge, streaks, offline, etc.
  • the processor may form a trivial classifier from each list, where that classifier may simply use each term as a category and may use the term to search for matching items. The item may then be placed into any groups or categories in which the term appears.
  • the processor 102 may select the best scoring classifier after scoring the plurality of classifiers. Various techniques may be used to select the best classifier. Several example techniques are described below in relation to Figure 3 but other techniques could also be used. [0028] Next, the processor may determine a group of residual items in the dataset 150 (at 230). As noted above, the residual items in the dataset may include items that are not in the known guidance set of items. Various techniques may be used to determine a group of residual items in the dataset 150.
  • the system may train at least one multiclass classifier on the guidance set of example items and may then apply the classifier to all unlabeled items (e.g., all items except those in the guidance set) in order to select the low confidence classifications (i.e., the residual set of items) as those that probably do not belong to the guidance set groups and therefore place these items in the residual.
  • a processor may train a classifier on groups of example items A vs. B and C, and do the same for groups B and C. Then, all items in the guidance set may be scored with respect to all (e.g., three) binary classifiers. Any items that score low with respect to all three classifiers may be determined to be part of the residual.
  • a processor may place all items in the dataset 150 that have any user labels into a positive class and all unlabeled items into a negative class. Then, a binary classifier may be trained on this two-class dataset. That classifier may separate dataset items in the positive class from items in the negative class and may be used to determine the residual data in the dataset 150 (e.g. selecting items that are classified as negative, or sorting all the items by the output score of the binary classifier and selecting, for example, the 30% most negative for use as the residual).
  • the processor 102 may then apply the selected classifier to the residual items in the dataset (at 240).
  • the processor may partition the residual items in the dataset into a plurality of groups or clusters according to the outputs of the selected classifier. Because the classifier is selected to be relevant to the user's clustering interest, as indicated by the guidance set, the generated clusters from the selected classifier will tend to be relevant to the user's vision of clustering the dataset 150.
  • the processor 102 may remove any group from the plurality of groups (generated at 250) that corresponds to a known group from the guidance set of example items.
  • the processor finds a group in the residual items of the dataset 150 that is already known (e.g., one of groups A, B, C, identified by the user), the processor may remove that group so that it is not proposed to the user.
  • the processor may retain any group from the plurality of groups (generated at 250) that corresponds to a known group from the guidance set of example items. Thus, all generated groups may be analyzed before they are outputted.
  • the processor may output at least one group from the plurality of groups (e.g., on the interface 160).
  • the at least one outputted group may not include any group from the plurality of groups that corresponds to a known group from the guidance set of example items.
  • the processor may identify the largest group (e.g., the group having the most residual items) and may output at least the largest group from the plurality of groups.
  • the processor may sort the plurality of groups and may output a sequence of groups (e.g., the number may be selected by the user) (e.g., the largest may be first).
  • sorting the plurality of groups may be performed based on a size of the groups determined based on a count of residual items in each group that are designated by the selected classifier.
  • the items in the dataset may include a cost or importance number that reflects the relative importance of the item (e.g., cost to fix an item, time to travel, etc.).
  • the processor may calculate the total cost or importance of each generated group, as measured by this numerical field. For example, group A1 may not have very many residual items that are designated by the selected classifier, but the total cost of that group may be higher than group A2.
  • the processor may alternatively output the group having the highest totals cost from the plurality of groups.
  • the selected classifier may cover just a narrow range of groups (e.g., monitor issues like cracked display, dimed display, flickering displays, etc.), where the dataset 150 (including the guidance set) has many items about other types of customer support logs (e.g., keyboard issues, email issues, etc.). Thus, such classifier may mistakenly classify items from the dataset 150 as other types.
  • the processor may first separate from the group of residual items those cases that are outside the range of the selected classifier. This set of items may be identified as residual to the residual (i.e., residual items to the selected classifier that not related to the classifier). These items are unlike the items in the guidance set of example items and are also unlike the classes or groups of the selected classifier.
  • the processor determines two types of residual - regular group of residual items R (i.e., residual that the classifier can treat) and residual of the residual RR (residual that the classifier can't treat or items that are unlike the classifiers topics).
  • the processor may subtract RR from R, and may apply the selected classifier only to these items from the dataset (R-RR) to identify clusters (as described in block 240). This may generate a first cluster X1 .
  • the processor may then apply any known clustering by intent classification method to the items of RR (i.e., the residual of the residual) to generate a second proposed cluster X2.
  • both X1 and X2 may be shown to the user (block 260), and a user may make a choice which to select.
  • only the larger cluster may be offered to the user.
  • Figure 3 illustrates a flowchart showing an example of a method 300 for scoring a plurality of classifiers.
  • execution of the method 300 is described below with reference to the system 10, the components for executing the method 300 may be spread among multiple devices/systems.
  • the method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, and/or in the form of electronic circuitry.
  • the method 300 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100).
  • the method 300 begins at 310, where at least one processor may apply each of the plurality of classifiers to the example items in the guidance set (e.g., the groups A, B, C, identified by the user).
  • the processor only applies the plurality of classifiers to the guidance set, without training the classifiers on the data in the guidance set (which could be slow).
  • the processor may measure an accuracy of each of the plurality of classifiers to separate the example items in the guidance set to groups based on user's labels of the example items. In other words, the processor may determine how each of the classifiers separates the items in the guidance set to different groups by using the examples for each topic/group of items that identified by the user.
  • the processor may not have an established, mapped correspondence between the user topics or groups (e.g., A, B, C, etc.) in the guidance set and the classes of the different classifiers.
  • the processor may perform accuracy measurement by treating each output category of the classifier as a cluster.
  • the processor may determine an average purity of the user's labels for the groups of example items to measure the accuracy of each of the plurality of classifiers. In other words, each group or cluster may be assessed for its purity.
  • the processor may determine the most common label of the example items in the cluster (e.g., C - monitor issues), and then determine the percent of example items in the cluster that are of the same label as that most common label. Then, an average purity over all identified groups for the guidance set may be determined.
  • the processor may apply a scoring penalty to any classifier from the plurality of classifiers that does not include more classes than the number of labels of example items in the guidance set.
  • the scoring penalty may be based on a number of additional classes provided by a classifier that exceed the number of labels or groups of example items in the guidance set. In other words, any classifier that does not offer more classes then the labels in the guidance set (e.g., A, B, C) may receive a scoring penalty.
  • the scoring penalty may be a value assigned to that classifier or may eliminate the classifier from further consideration.
  • the processor may apply at least one of the plurality of classifiers to at least a portion of the residual items in the dataset while scoring the plurality of classifiers.
  • the processor may also apply a scoring penalty to at least one of the plurality of classifiers when the classifier does not partition the residual items in the dataset to at least a predetermined number of groups.
  • the processor may apply it to the residual (or portion of the residual) in advance, in order to see how this classifier will treat the residual. The goal is to find the best classifiers to subdivide the residual.
  • the processor may apply a scoring penalty to at least one of the plurality of classifiers when the classifier separates the example items in the guidance set to a number of groups that is larger than a predetermined number, where the predetermined number is at least larger than the number of labels of example items in the guidance set.
  • the processor may apply a penalty to a classifier when it creates many more clusters out of the guidance set than there are known topics/classes of the classifier.
  • a classifier that subdivides a dataset very finely into many small clusters would inherently achieve higher purity, possibly by chance.
  • the processor may penalized such a classifier as not aligning well with the number of topics/groups the user has labeled so far.
  • the processor may determine the best scoring classifiers based on the accuracy of separation (at 330). In other words, after applying the classifier and any scoring penalties, the processor may identify the best scoring classifier to be used by the system (as described in Figure 3).
  • Figure 4 illustrates a computer 401 and a non-transitory machine- readable medium 405 according to an example.
  • the computer 401 maybe similar to the computing device 100 of the system 10 or may include a plurality of computers.
  • the computer may be a server computer, a workstation computer, a desktop computer, a laptop, a mobile device, or the like, and may be part of a distributed system.
  • the computer may include one or more processors and one or more machine-readable storage media.
  • the computer may include a user interface (e.g., touch interface, mouse, keyboard, gesture input device, etc.).
  • Computer 401 may perform methods 200-300 and variations thereof. Additionally, the functionality implemented by computer 401 may be part of a larger software platform, system, application, or the like. Computer 401 may be connected to a database (not shown) via a network.
  • the network may be any type of communications network, including, but not limited to, wire-based networks (e.g., cable), wireless networks (e.g., cellular, satellite), cellular telecommunications network(s), and IP-based telecommunications network(s) (e.g., Voice over Internet Protocol networks).
  • the network may also include traditional landline or a public switched telephone network (PSTN), or combinations of the foregoing.
  • PSTN public switched telephone network
  • the computer 401 may include a processor 403 and non-transitory machine-readable storage medium 405.
  • the processor 403 e.g., a central processing unit, a group of distributed processors, a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC), a graphics processor, a multiprocessor, a virtual processor, a cloud processing system, or another suitable controller or programmable device
  • ASIC application-specific integrated circuit
  • GPU e.g., a graphics processor
  • multiprocessor e.g., a multiprocessor
  • virtual processor e.g., a virtual processor, a cloud processing system, or another suitable controller or programmable device
  • the storage medium 405 may be operatively coupled to a bus.
  • Processor 403 can include single or multiple cores on a chip, multiple cores across multiple chips, multiple cores across multiple devices, or combinations thereof.
  • the storage medium 405 may include any suitable type, number, and configuration of volatile or non- volatile machine-readable storage media to store instructions and data.
  • machine-readable storage media in include read-only memory (“ROM”), random access memory (“RAM”) (e.g., dynamic RAM ["DRAM”], synchronous DRAM ["SDRAM”], etc.), electrically erasable programmable read-only memory (“EEPROM”), magnetoresistive random access memory (MRAM), memristor, flash memory, SD card, floppy disk, compact disc read only memory (CD-ROM), digital video disc read only memory (DVD-ROM), and other suitable magnetic, optical, physical, or electronic memory on which software may be stored.
  • ROM read-only memory
  • RAM random access memory
  • EEPROM electrically erasable programmable read-only memory
  • MRAM magnetoresistive random access memory
  • CD-ROM compact disc read only memory
  • DVD-ROM digital video disc read only memory
  • Software stored on the non-transitory machine-readable storage media 405 and executed by the processor 403 includes, for example, firmware, applications, program data, filters, rules, program modules, and other executable instructions.
  • the processor 403 retrieves from the machine-readable storage media 405 and executes, among other things, instructions related to the control processes and methods described herein.
  • the processor 403 may fetch, decode, and execute instructions 307-313 among others, to implement various processing.
  • processor 403 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 307-313. Accordingly, processor 403 may be implemented across multiple processing units and instructions 307-313 may be implemented by different processing units in different areas of computer 401 .
  • IC integrated circuit
  • the instructions 307-313 when executed by processor 403 can cause processor 403 to perform processes, for example, methods 200-400, and/or variations and portions thereof. In other examples, the execution of these and other methods may be distributed between the processor 403 and other processors in communication with the processor 403.
  • storing instructions 307 may cause processor 403 to rank a plurality of classifiers on a guidance set of example items in a dataset of items, where the guidance set of example items in the dataset identifies groups of interest to a user select the best scoring classifier.
  • scoring instructions 307 may cause processor 307 select the best scoring classifier from the plurality of classifiers.
  • Analysis instructions 31 1 may cause the processor 403 to determine a group of residual items in the dataset. These instructions may function similarly to the techniques described in block 230 of method 200.
  • Classification instructions 315 may cause the processor 403 to apply the selected classifier to the residual items in the dataset and to partition the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier. Further, classification instructions 315 may cause the processor 403 to output at least one group from the plurality of groups. These instructions may function similarly to the techniques described blocks 240-260 of method 200.

Abstract

An example method is provided in according with one implementation of the present disclosure. The method comprises scoring a plurality of classifiers on a guidance set of example items in a dataset of items, and selecting the best scoring classifier. The method also comprises determining a group of residual items in the dataset, and applying the selected classifier to the residual items in the dataset. The method further comprises partitioning the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier, and outputting at least one group from the plurality of groups.

Description

DATASET PARTITION ING
[0001] Clustering or partitioning of data is typically the task of grouping a set of items or objects in such a way that objects in the same group (e.g., cluster) are more similar to each other than to those in other groups (e.g., clusters). In a typical scenario, a user provides a clustering application with a plurality of objects that are to be clustered. The clustering application typically generates clusters from the plurality of objects in an unsupervised manner, where the clusters may be of interest to the user.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] Figure 1 is a schematic illustration of an example system for generating groups of items in a dataset by using a classifier selected based on a user's input in accordance with an implementation of the present disclosure.
[0003] Figure 2 illustrates a flowchart showing an example of a method for generating groups of items in a dataset by using a classifier selected based on a user's input in accordance with an implementation of the present disclosure.
[0004] Figure 3 illustrates a flowchart showing an example of a method for scoring a plurality of classifiers in accordance with an implementation of the present disclosure.
[0005] Figure 4 is an example block diagram illustrating a computer- readable medium in accordance with an implementation of the present disclosure.
DETAILED DESCRIPTION OF SPECIFIC EXAMPLES
[0006] As mentioned above, clustering or partitioning of data has become increasingly popular in recent years. Many organizations use various data clustering methods and techniques to help them analyze and cluster different types of data (e.g., customer surveys, customer support logs, engineer repair notes, system logs, etc.). As used herein, the terms "group" and "cluster" are to be used interchangeably and refer to a set of items that are grouped in a way such that items in the same group are more similar to each other than to those in other groups. As used herein, the terms "item" and "object" are to be used interchangeably and refer to an element in a dataset (e.g., document, word, numerical values, etc.).
[0007] However, in a clustering application that generates clusters in an unsupervised manner, the resulting clusters may not be useful to a user or only a small fraction of the proposed clusters may be relevant to a user. In addition to generating frequently poor results, existing clustering techniques tend to be slow. In fact, it is often found that such traditional clustering techniques fail to produce useful clusters even with repeated attempts at adjusting the various parameters by data mining experts. Furthermore, once some initial large clusters are recognized and dealt with, the remaining data tends to produce decreasingly useful clusters. The described issues and trends may be expected by data mining practitioners, but can prove somewhat disappointing to business users.
[0008] There are various important characteristics that a user (e.g., a domain expert) may be looking for in a clustering technique. First, the purity of the returned clusters matters greatly to the domain expert. It is easier to recognize a topic if the cluster has high purity, ideally just a single topic. For typical, complex text domain data, determining the meaning and worth of a proposed cluster can take the user awhile examining its cases. Thus, it is best to provide a manageable list of cases that are most typical or central to the cluster, rather than return a much larger set of cases that may include some other topics mixed in. Second, the size of the cluster topic matters to the user. Although the returned cluster may be a small list of cases, the underlying topic that it informs the user about may be large. Users ordinarily prefer to discover the larger topics first, ideally working down the tail in order.
[0009] In this regard, according to examples, techniques for generating clusters or groups of items in a dataset by using a classifier selected based on a user's input are disclosed herein. The generated groups or clusters align with a user's expectations of the way that data should be organized. In one example, the proposed techniques leverage a user-provided guidance set, which is a set of items labeled or tagged into distinct categories or groups that are given or selected by a user. The guidance set is used to evaluate the relevance of each of a set of trained classifiers. In other words, the guidance set is not used to train a classifier, but is only used to evaluate an already trained classifier on this guidance set. The best classifier selected from the evaluation is then applied to the residual items in the entire dataset (i.e., items for which relevant subgroups are sought, e.g., all items that are not in the known guidance set of items), and the output is used to deliver a relevant clustering of the dataset into a plurality of groups, which may be presented to the user. In one example, the proposed techniques determine the largest of these clusters as the best to offer the user.
[0010] Therefore, beyond clustering the data in the dataset, the proposed techniques disclosed herein also deliver an interactive process to provide for customization of clustering results to a user's particular needs and intentions. In one example, a processor may score a plurality of classifiers on a guidance set of example items in a dataset of items to select the best scoring classifier. The processor may further determine a group of residual items in the dataset and may apply the selected classifier to the residual items in the dataset. Finally, the processor may partition the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier, and may output at least one group from the plurality of groups.
[0011] In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific examples in which the disclosed subject matter may be practiced. It is to be understood that other examples may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising" or "having" and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Furthermore, the term "based on," as used herein, means "based at least in part on." It should also be noted that a plurality of hardware and software based devices, as well as a plurality of different structural components may be used to implement the disclosed methods and devices.
[0012] Referring now to the figures, Figure 1 is a schematic illustration of an example system 10 for generating clusters or groups of items in a dataset by using a classifier selected based on a user's input. The illustrated system 10 is capable of carrying out the techniques described below. As shown in Figure 1 , the system 10 is depicted as including at least one a computing device 100. In the embodiment of Figure 1 , computing device 100 includes a processor 102, an interface 106, and a machine-readable storage medium 1 10. Although only computing device 100 is described in details below, the techniques described herein may be performed with several computing devices or by engines distributed on different devices.
[0013] In one example, the computing device 100 (or another computing device) may receive a dataset 150 that is to be clustered and may communicate with at least one interactive user interface 160 (e.g., graphical user interface, etc.). The dataset 150 may include categorical data, numerical data, structured data, unstructured data, or any other type of data.
[0014] In one example, the data in the dataset 150 may be in structured form. For example, the data may be represented as a linked database, a tabular array, an excel worksheet, a graph, a tree, and so forth. In some examples, the data in the dataset 150 may be unstructured. For example, the data may be a collection of log messages, snippets from text messages, messages from social networking platforms, and so forth. In some examples, the data may be in semi- structured form.
[0015] In another example, the data in the dataset 150 may be represented as an array. For example, columns may represent features of the data, whereas rows may represent data elements. For example, rows may represent a traffic incident, whereas columns may represent features associated with each traffic incident, including weather conditions, road conditions, time of day, date, a number of casualties, types of injuries, victims' ages, and so forth.
[0016] The computing device 100 may be any type of a computing device and may include at least engines 120-140. In one implementation, the computing device 100 may be an independent computing device. Engines 120-140 may or may not be part of the machine-readable storage medium 1 10. In another alternative example, engines 120-140 may be distributed between the computing device 100 and other computing devices. The computing device 100 may include additional components and some of the components depicted therein may be removed and/or modified without departing from a scope of the system that allows for carrying out the functionality described herein. It is to be understood that the operations described as being performed by the engines 120-140 of the computing device 100 that are related to this description may, in some implementations, be performed by external engines (not shown) or distributed between the engines of the computing device 100 and other electronic/computing devices.
[0017] Processor 102 may be central processing unit(s) (CPUs), microprocessor(s), and/or other hardware device(s) suitable for retrieval and execution of instructions (not shown) stored in machine-readable storage medium 1 10. Processor 102 may fetch, decode, and execute instructions to identify different groups in a dataset. As an alternative or in addition to retrieving and executing instructions, processor 102 may include electronic circuits comprising a number of electronic components for performing the functionality of instructions.
[0018] Interface 106 may include a number of electronic components for communicating with various devices. For example, interface 106 may be an Ethernet interface, a Universal Serial Bus (USB) interface, an IEEE 1394 (Firewire) interface, an external Serial Advanced Technology Attachment (eSATA) interface, or any other physical connection interface suitable for communication with the computing device. Alternatively, interface 106 may be a wireless interface, such as a wireless local area network (WLAN) interface or a near-field communication (NFC) interface that is used to connect with other devices/systems and/or to a network. The user interface 160 and the computing device 100 may be connected via a network. In one example, the network may be a mesh sensor network (not shown). The network may include any suitable type or configuration of network to allow for communication between the computing device 100, the user interface 160, and any other devices/systems (e.g., other computing devices, displays, etc.), for example, to send and receive data to and from a corresponding interface of another device.
[0019] Each of the engines 120-140 may include, for example, at least one hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory. In addition or as an alternative, the engines 120-140 may be implemented as any combination of hardware and software to implement the functionalities of the engines. For example, the hardware may be a processor and the software may be a series of instructions or microcode encoded on a machine-readable storage medium and executable by the processor. Therefore, as used herein, an engine may include program code (e.g., computer executable instructions), hardware, firmware, and/or logic, or combination thereof to perform particular actions, tasks, and functions described in more detail herein in reference to Figures 2-4.
[0020] In one example, the scoring engine 120 may score a plurality of classifiers on a guidance set of example items in a dataset of items (e.g., the dataset 150) and may select the best scoring classifier. In one implementation, the scoring engine 120 may apply each of the plurality of classifiers to the example items in the guidance set, may determine an accuracy of each of the plurality of classifiers to separate the example items in the guidance set to groups based on user's labels of the example items, and may determine the best scoring classifiers based on the accuracy of separation. In addition, the scoring engine 120 may select the best scoring classifier from the plurality of classifiers. As explained in additional details below, the guidance set of example items in the dataset identifies groups of interest to a user. Various techniques may be used to score the plurality of classifiers. In addition, the scoring engine 120 may select the best scoring classifier from the plurality of classifiers.
[0021] The analysis engine 130 may determine a group of residual items in the dataset (i.e., a set of unlabeled items for which relevant groupings or clusters are sought). In one example, the residual items may include all items that are not already labeled (i.e., eliminating those items that are in the guidance set). In another example, the residual items may include all unlabeled items that do not belong to the categories defined by the labels of the guidance set. Various techniques may be used to determine a group of residual items in the dataset.
[0022] The classification engine 140 may apply the selected classifier (i.e., the classifier selected by the scoring engine 120) to the residual items in the dataset to order to partition the residual items into a plurality of groups corresponding to the categories output by the classifier. The engine 140 may remove any group from the plurality of groups that corresponds to a group from the guidance set of example items. In one example, the selected classifier may be applied to the guidance set, and then the classification engine 140 may remove output groups that receive any (or many) of the guidance set items. In addition, the classification engine 140 may partition the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier, and may output at least one group from the plurality of groups (e.g., on the interface 160).
[0023] Figure 2 illustrates a flowchart showing an example of a method 200 for generating clusters or groups of items in a dataset by using a classifier selected based on a user's input. Although execution of the method 200 is described below with reference to the system 10, the components for executing the method 200 may be spread among multiple devices/systems. The method 200 may be implemented in the form of executable instructions stored on a machine-readable storage medium, and/or in the form of electronic circuitry.
[0024] In one example, the method 200 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100). In other examples, the method may be executed by another processor in communication with the system 10. Various elements or blocks described herein with respect to the method 200 are capable of being executed simultaneously, in parallel, or in an order that differs from the illustrated serial manner of execution. The method 200 is also capable of being executed using additional or fewer elements than are shown in the illustrated examples.
[0025] The method 200 begins at 210, where at least one processor may score a plurality of classifiers on a guidance set of example items in a dataset (e.g., 150) of items. In some examples, the processor may use a guidance set of example items in the dataset 150 to score the plurality of classifiers. As noted above, a guidance set of example items in a dataset of items may include a portion of the entire dataset 150 and may identify groups of interest to a user (i.e., by using user labels on the guidance set). In one implementation, a user may wish to divide a dataset of items in a customer support log based on the type of support issues that are received. In that case, a user may identify topics or groups of example items that are of interest to him or her: A = server issues, B = laptop issues, and C = monitor issues. These topics or groups of example items form the guidance set of items. Thus, the guidance set of example items in the dataset 150 identifies groups (e.g., A, B, C, etc.) of interest to the user.
[0026] Further, a user may label example items (e.g., log entries, documents, etc.) in the guidance set by providing labels on some of the example items. In other words, the user may provide examples for some of the topics/groups that is he or she interested in in the guidance set of items. As explained in additional details below, this information may be used to score the plurality of classifiers. An example technique for scoring the classifiers is described below in relation to Figure 3. In some examples, the plurality of classifiers may include a library (not shown) of multi-class classifiers (i.e., ready- to-use classifiers, not classifier methodologies such as Naive Bayes). Each such classifier may be at least one of a previously-trained machine-learning classifier and a simple multi-class classifier formed from a list of related terms. The processor 102 may form a simple multi-class classifier using a list of related terms (i.e., keywords or phrases). For example, list 1 may include the terms: hard disk drive, mouse, keyboard, screen adapter, USB, etc., and list 2 may include the terms: jam, ink, cartridge, streaks, offline, etc. Thus, the processor may form a trivial classifier from each list, where that classifier may simply use each term as a category and may use the term to search for matching items. The item may then be placed into any groups or categories in which the term appears.
[0027] At 220, the processor 102 may select the best scoring classifier after scoring the plurality of classifiers. Various techniques may be used to select the best classifier. Several example techniques are described below in relation to Figure 3 but other techniques could also be used. [0028] Next, the processor may determine a group of residual items in the dataset 150 (at 230). As noted above, the residual items in the dataset may include items that are not in the known guidance set of items. Various techniques may be used to determine a group of residual items in the dataset 150. In one example, the system may train at least one multiclass classifier on the guidance set of example items and may then apply the classifier to all unlabeled items (e.g., all items except those in the guidance set) in order to select the low confidence classifications (i.e., the residual set of items) as those that probably do not belong to the guidance set groups and therefore place these items in the residual. For instance, a processor may train a classifier on groups of example items A vs. B and C, and do the same for groups B and C. Then, all items in the guidance set may be scored with respect to all (e.g., three) binary classifiers. Any items that score low with respect to all three classifiers may be determined to be part of the residual.
[0029] In another example, a processor may place all items in the dataset 150 that have any user labels into a positive class and all unlabeled items into a negative class. Then, a binary classifier may be trained on this two-class dataset. That classifier may separate dataset items in the positive class from items in the negative class and may be used to determine the residual data in the dataset 150 (e.g. selecting items that are classified as negative, or sorting all the items by the output score of the binary classifier and selecting, for example, the 30% most negative for use as the residual).
[0030] With continued reference to Figure 1 , the processor 102 may then apply the selected classifier to the residual items in the dataset (at 240). At 250, the processor may partition the residual items in the dataset into a plurality of groups or clusters according to the outputs of the selected classifier. Because the classifier is selected to be relevant to the user's clustering interest, as indicated by the guidance set, the generated clusters from the selected classifier will tend to be relevant to the user's vision of clustering the dataset 150.
[0031] In one example, the processor 102 may remove any group from the plurality of groups (generated at 250) that corresponds to a known group from the guidance set of example items. In other words, when the processor finds a group in the residual items of the dataset 150 that is already known (e.g., one of groups A, B, C, identified by the user), the processor may remove that group so that it is not proposed to the user. In another example, the processor may retain any group from the plurality of groups (generated at 250) that corresponds to a known group from the guidance set of example items. Thus, all generated groups may be analyzed before they are outputted.
[0032] At 260, the processor may output at least one group from the plurality of groups (e.g., on the interface 160). As noted above, the at least one outputted group may not include any group from the plurality of groups that corresponds to a known group from the guidance set of example items. In one example, the processor may identify the largest group (e.g., the group having the most residual items) and may output at least the largest group from the plurality of groups. In another example, the processor may sort the plurality of groups and may output a sequence of groups (e.g., the number may be selected by the user) (e.g., the largest may be first). In one implementation, sorting the plurality of groups may be performed based on a size of the groups determined based on a count of residual items in each group that are designated by the selected classifier. In another implementation, the items in the dataset may include a cost or importance number that reflects the relative importance of the item (e.g., cost to fix an item, time to travel, etc.). Thus, instead of counting the total number of items in each group, the processor may calculate the total cost or importance of each generated group, as measured by this numerical field. For example, group A1 may not have very many residual items that are designated by the selected classifier, but the total cost of that group may be higher than group A2. Thus, the processor may alternatively output the group having the highest totals cost from the plurality of groups.
[0033] In some examples, the selected classifier may cover just a narrow range of groups (e.g., monitor issues like cracked display, dimed display, flickering displays, etc.), where the dataset 150 (including the guidance set) has many items about other types of customer support logs (e.g., keyboard issues, email issues, etc.). Thus, such classifier may mistakenly classify items from the dataset 150 as other types. To avoid this situation, the processor may first separate from the group of residual items those cases that are outside the range of the selected classifier. This set of items may be identified as residual to the residual (i.e., residual items to the selected classifier that not related to the classifier). These items are unlike the items in the guidance set of example items and are also unlike the classes or groups of the selected classifier. In other words, the processor determines two types of residual - regular group of residual items R (i.e., residual that the classifier can treat) and residual of the residual RR (residual that the classifier can't treat or items that are unlike the classifiers topics).
[0034] In one implementation, the processor may subtract RR from R, and may apply the selected classifier only to these items from the dataset (R-RR) to identify clusters (as described in block 240). This may generate a first cluster X1 . The processor may then apply any known clustering by intent classification method to the items of RR (i.e., the residual of the residual) to generate a second proposed cluster X2. In one example, both X1 and X2 may be shown to the user (block 260), and a user may make a choice which to select. In another example, only the larger cluster may be offered to the user.
[0035] Figure 3 illustrates a flowchart showing an example of a method 300 for scoring a plurality of classifiers. Although execution of the method 300 is described below with reference to the system 10, the components for executing the method 300 may be spread among multiple devices/systems. The method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, and/or in the form of electronic circuitry. In one example, the method 300 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100).
[0036] The method 300 begins at 310, where at least one processor may apply each of the plurality of classifiers to the example items in the guidance set (e.g., the groups A, B, C, identified by the user). In one example, the processor only applies the plurality of classifiers to the guidance set, without training the classifiers on the data in the guidance set (which could be slow).
[0037] At 320, the processor may measure an accuracy of each of the plurality of classifiers to separate the example items in the guidance set to groups based on user's labels of the example items. In other words, the processor may determine how each of the classifiers separates the items in the guidance set to different groups by using the examples for each topic/group of items that identified by the user. The processor may not have an established, mapped correspondence between the user topics or groups (e.g., A, B, C, etc.) in the guidance set and the classes of the different classifiers. Thus, in one example, the processor may perform accuracy measurement by treating each output category of the classifier as a cluster. In one implementation, the processor may determine an average purity of the user's labels for the groups of example items to measure the accuracy of each of the plurality of classifiers. In other words, each group or cluster may be assessed for its purity. The processor may determine the most common label of the example items in the cluster (e.g., C - monitor issues), and then determine the percent of example items in the cluster that are of the same label as that most common label. Then, an average purity over all identified groups for the guidance set may be determined.
[0038] In another example, the processor may apply a scoring penalty to any classifier from the plurality of classifiers that does not include more classes than the number of labels of example items in the guidance set. The scoring penalty may be based on a number of additional classes provided by a classifier that exceed the number of labels or groups of example items in the guidance set. In other words, any classifier that does not offer more classes then the labels in the guidance set (e.g., A, B, C) may receive a scoring penalty. The scoring penalty may be a value assigned to that classifier or may eliminate the classifier from further consideration.
[0039] In a further example, the processor may apply at least one of the plurality of classifiers to at least a portion of the residual items in the dataset while scoring the plurality of classifiers. The processor may also apply a scoring penalty to at least one of the plurality of classifiers when the classifier does not partition the residual items in the dataset to at least a predetermined number of groups. Thus, while scoring any classifier, the processor may apply it to the residual (or portion of the residual) in advance, in order to see how this classifier will treat the residual. The goal is to find the best classifiers to subdivide the residual. [0040] In yet another example, the processor may apply a scoring penalty to at least one of the plurality of classifiers when the classifier separates the example items in the guidance set to a number of groups that is larger than a predetermined number, where the predetermined number is at least larger than the number of labels of example items in the guidance set. In other words, the processor may apply a penalty to a classifier when it creates many more clusters out of the guidance set than there are known topics/classes of the classifier. A classifier that subdivides a dataset very finely into many small clusters would inherently achieve higher purity, possibly by chance. Thus, the processor may penalized such a classifier as not aligning well with the number of topics/groups the user has labeled so far.
[0041] With continued reference to Figure 3, the processor may determine the best scoring classifiers based on the accuracy of separation (at 330). In other words, after applying the classifier and any scoring penalties, the processor may identify the best scoring classifier to be used by the system (as described in Figure 3).
[0042] Figure 4 illustrates a computer 401 and a non-transitory machine- readable medium 405 according to an example. In one example, the computer 401 maybe similar to the computing device 100 of the system 10 or may include a plurality of computers. For example, the computer may be a server computer, a workstation computer, a desktop computer, a laptop, a mobile device, or the like, and may be part of a distributed system. The computer may include one or more processors and one or more machine-readable storage media. In one example, the computer may include a user interface (e.g., touch interface, mouse, keyboard, gesture input device, etc.).
[0043] Computer 401 may perform methods 200-300 and variations thereof. Additionally, the functionality implemented by computer 401 may be part of a larger software platform, system, application, or the like. Computer 401 may be connected to a database (not shown) via a network. The network may be any type of communications network, including, but not limited to, wire-based networks (e.g., cable), wireless networks (e.g., cellular, satellite), cellular telecommunications network(s), and IP-based telecommunications network(s) (e.g., Voice over Internet Protocol networks). The network may also include traditional landline or a public switched telephone network (PSTN), or combinations of the foregoing.
[0044] The computer 401 may include a processor 403 and non-transitory machine-readable storage medium 405. The processor 403 (e.g., a central processing unit, a group of distributed processors, a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC), a graphics processor, a multiprocessor, a virtual processor, a cloud processing system, or another suitable controller or programmable device) and the storage medium 405 may be operatively coupled to a bus. Processor 403 can include single or multiple cores on a chip, multiple cores across multiple chips, multiple cores across multiple devices, or combinations thereof.
[0045] The storage medium 405 may include any suitable type, number, and configuration of volatile or non- volatile machine-readable storage media to store instructions and data. Examples of machine-readable storage media in include read-only memory ("ROM"), random access memory ("RAM") (e.g., dynamic RAM ["DRAM"], synchronous DRAM ["SDRAM"], etc.), electrically erasable programmable read-only memory ("EEPROM"), magnetoresistive random access memory (MRAM), memristor, flash memory, SD card, floppy disk, compact disc read only memory (CD-ROM), digital video disc read only memory (DVD-ROM), and other suitable magnetic, optical, physical, or electronic memory on which software may be stored.
[0046] Software stored on the non-transitory machine-readable storage media 405 and executed by the processor 403 includes, for example, firmware, applications, program data, filters, rules, program modules, and other executable instructions. The processor 403 retrieves from the machine-readable storage media 405 and executes, among other things, instructions related to the control processes and methods described herein.
[0047] The processor 403 may fetch, decode, and execute instructions 307-313 among others, to implement various processing. As an alternative or in addition to retrieving and executing instructions, processor 403 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 307-313. Accordingly, processor 403 may be implemented across multiple processing units and instructions 307-313 may be implemented by different processing units in different areas of computer 401 .
[0048] The instructions 307-313 when executed by processor 403 (e.g., via one processing element or multiple processing elements of the processor) can cause processor 403 to perform processes, for example, methods 200-400, and/or variations and portions thereof. In other examples, the execution of these and other methods may be distributed between the processor 403 and other processors in communication with the processor 403.
[0049] For example, storing instructions 307 may cause processor 403 to rank a plurality of classifiers on a guidance set of example items in a dataset of items, where the guidance set of example items in the dataset identifies groups of interest to a user select the best scoring classifier. Further, scoring instructions 307 may cause processor 307 select the best scoring classifier from the plurality of classifiers. These instructions may function similarly to the techniques described in blocks 210-20 of method 200 and to the techniques described in method 300.
[0050] Analysis instructions 31 1 may cause the processor 403 to determine a group of residual items in the dataset. These instructions may function similarly to the techniques described in block 230 of method 200.
[0051] Classification instructions 315 may cause the processor 403 to apply the selected classifier to the residual items in the dataset and to partition the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier. Further, classification instructions 315 may cause the processor 403 to output at least one group from the plurality of groups. These instructions may function similarly to the techniques described blocks 240-260 of method 200.
[0052] In the foregoing description, numerous details are set forth to provide an understanding of the subject matter disclosed herein. However, implementations may be practiced without some or all of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.

Claims

CLAIMS What is claimed is:
1 . A method comprising, by at least one processor:
scoring a plurality of classifiers on a guidance set of example items in a dataset of items;
selecting a best scoring classifier;
determining a group of residual items in the entire dataset;
applying the selected classifier to the residual items in the dataset;
partitioning the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier; and
outputting at least one group from the plurality of groups.
2. The method of claim 1 , further comprising, by at least one processor: removing any group from the plurality of groups that corresponds to a group from the guidance set of example items.
3. The method of claim 1 , further comprising, by at least one processor: outputting at least one of:
the largest group from the plurality of groups, wherein the largest group is determined based on a size of the groups which is based on a count of residual items in each group that are designated by the selected classifier, and
the group having the highest totals cost from the plurality of groups.
4. The method of claim 1 , wherein the guidance set of example items in the dataset identifies groups of interest to a user.
5. The method of claim 1 , wherein the plurality of classifiers include a library of multi-class classifiers, and wherein each classifier may be at least one of a previously-trained machine-learning classifier and a multi-class classifier formed from a list of related terms.
6. The method of claim 1 , further comprising, by at least one processor:
applying each of the plurality of classifiers to the example items in the guidance set;
measuring an accuracy of each of the plurality of classifiers to separate the example items in the guidance set to groups based on user's labels of the example items; and
determining the best scoring classifiers based on the accuracy of separation.
7. The method of claim 6, further comprising, by at least one processor: determining an average purity of the user's labels for the groups of example items to measure the accuracy of each of the plurality of classifiers.
8. The method of claim 6, further comprising, by at least one processor: applying a scoring penalty to any classifier from the plurality of classifiers that does not include more classes than the number of labels of example items in the guidance set, wherein the scoring penalty is based on a number of additional classes provided by a classifier that exceed the number of labels of example items in the guidance set.
9. The method of claim 6, further comprising, by at least one processor:
applying at least one of the plurality of classifiers to at least a portion of the residual items in the dataset while scoring the plurality of classifiers; and
applying a scoring penalty to at least one of the plurality of classifiers when the classifier does not partition the residual items in the dataset to at least a predetermined number of groups.
10. The method of claim 6, further comprising, by at least one processor: applying a scoring penalty to at least one of the plurality of classifiers when the classifier separates the example items in the guidance set to a number of groups that is larger than a predetermined number, wherein the predetermined number is at least larger than the number of labels of example items in the guidance set.
1 1 . A system comprising:
a scoring engine to:
score a plurality of classifiers on a guidance set of example items in a dataset of items, wherein the guidance set of example items in the dataset identifies groups of interest to a user;
select a best scoring classifier;
an analysis engine to determine a group of residual items in the entire dataset; and
a classification engine to:
apply the selected classifier to the residual items in the dataset, remove any group from the plurality of groups that corresponds to a group from the guidance set of example items,
partition the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier, and
output at least one group from the plurality of groups.
12. The system of claim 7, wherein the scoring engine is further to:
apply each of the plurality of classifiers to the example items in the guidance set;
determine an accuracy of each of the plurality of classifiers to separate the example items in the guidance set to groups based on user's labels of the example items; and
determine the best scoring classifiers based on the accuracy of separation.
13. The system of claim 8, wherein the scoring engine is further to: apply at least one of the plurality of classifiers to at least a portion of the residual items in the dataset while scoring the plurality of classifiers; and apply a scoring penalty to at least one of the plurality of classifiers when the classifier does not partition the residual items in the dataset to at least a predetermined number of groups.
14. A non-transitory machine-readable storage medium encoded with instructions executable by at least one processor, the machine-readable storage medium comprising instructions to:
rank a plurality of classifiers on a guidance set of example items in a dataset of items,
wherein the guidance set of example items in the dataset identifies groups of interest to a user select a best scoring classifier;
select a best scoring classifier;
determine a group of residual items in the entire dataset;
apply the selected classifier to the residual items in the dataset;
partition the residual items in the dataset into a plurality of groups according to the outputs of the selected classifier; and
output at least one group from the plurality of groups.
15. The non-transitory machine-readable storage medium of claim 1 1 , further comprising instructions to:
remove any group from the plurality of groups that corresponds to a group from the guidance set of example items; and
sort the plurality of groups.
PCT/US2015/045307 2015-08-14 2015-08-14 Dataset partitioning WO2017030535A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2015/045307 WO2017030535A1 (en) 2015-08-14 2015-08-14 Dataset partitioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/045307 WO2017030535A1 (en) 2015-08-14 2015-08-14 Dataset partitioning

Publications (1)

Publication Number Publication Date
WO2017030535A1 true WO2017030535A1 (en) 2017-02-23

Family

ID=58051021

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/045307 WO2017030535A1 (en) 2015-08-14 2015-08-14 Dataset partitioning

Country Status (1)

Country Link
WO (1) WO2017030535A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080103996A1 (en) * 2006-10-31 2008-05-01 George Forman Retraining a machine-learning classifier using re-labeled training samples
WO2012103625A1 (en) * 2011-02-04 2012-08-09 Holland Bloorview Kids Rehabilitation Hospital Reputation-based classifier, classification system and method
US20120284212A1 (en) * 2011-05-04 2012-11-08 Google Inc. Predictive Analytical Modeling Accuracy Assessment
US20150071556A1 (en) * 2012-04-30 2015-03-12 Steven J Simske Selecting Classifier Engines

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080103996A1 (en) * 2006-10-31 2008-05-01 George Forman Retraining a machine-learning classifier using re-labeled training samples
WO2012103625A1 (en) * 2011-02-04 2012-08-09 Holland Bloorview Kids Rehabilitation Hospital Reputation-based classifier, classification system and method
US20120284212A1 (en) * 2011-05-04 2012-11-08 Google Inc. Predictive Analytical Modeling Accuracy Assessment
US20150071556A1 (en) * 2012-04-30 2015-03-12 Steven J Simske Selecting Classifier Engines

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ALBERT HUNG-REN KO ET AL.: "Single Classifier-based Multiple Classification Scheme for weak classifiers: An experimental comparison.", EXPERT SYSTEMS WITH APPLICATIONS, vol. 40, no. 9, 3 January 2013 (2013-01-03), pages 3606 - 3622, XP028983510 *

Similar Documents

Publication Publication Date Title
US11657231B2 (en) Capturing rich response relationships with small-data neural networks
US11631029B2 (en) Generating combined feature embedding for minority class upsampling in training machine learning models with imbalanced samples
Li et al. Unsupervised streaming feature selection in social media
EP3227836B1 (en) Active machine learning
US9305083B2 (en) Author disambiguation
US20200110842A1 (en) Techniques to process search queries and perform contextual searches
US10949450B2 (en) Mtransaction processing improvements
EP3612960A1 (en) Hybrid approach to approximate string matching using machine learning
US20140214835A1 (en) System and method for automatically classifying documents
AU2015203818B2 (en) Providing contextual information associated with a source document using information from external reference documents
US20130212111A1 (en) System and method for text categorization based on ontologies
US10803057B1 (en) Utilizing regular expression embeddings for named entity recognition systems
US11403550B2 (en) Classifier
US11604981B2 (en) Training digital content classification models utilizing batchwise weighted loss functions and scaled padding based on source density
US20160063122A1 (en) Event summarization
Jo Using K Nearest Neighbors for text segmentation with feature similarity
US20160085848A1 (en) Content classification
US10572525B2 (en) Determining an optimized summarizer architecture for a selected task
Medvet et al. Brand-related events detection, classification and summarization on twitter
KR101585644B1 (en) Apparatus, method and computer program for document classification using term association analysis
US10726055B2 (en) Multi-term query subsumption for document classification
US20150370834A1 (en) A schema generation process and system
US20160292282A1 (en) Detecting and responding to single entity intent queries
US11449789B2 (en) System and method for hierarchical classification
US11556514B2 (en) Semantic data type classification in rectangular datasets

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15901815

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15901815

Country of ref document: EP

Kind code of ref document: A1