WO2010117889A3 - Scalable clustering - Google Patents

Scalable clustering Download PDF

Info

Publication number
WO2010117889A3
WO2010117889A3 PCT/US2010/029715 US2010029715W WO2010117889A3 WO 2010117889 A3 WO2010117889 A3 WO 2010117889A3 US 2010029715 W US2010029715 W US 2010029715W WO 2010117889 A3 WO2010117889 A3 WO 2010117889A3
Authority
WO
WIPO (PCT)
Prior art keywords
features
clustering system
keywords
millions
items
Prior art date
Application number
PCT/US2010/029715
Other languages
French (fr)
Other versions
WO2010117889A2 (en
Inventor
Anton Schwaighofer
Joaquin Quinonero Candela
Thomas Borchert
Thore Graepel
Ralf Herbrich
Original Assignee
Microsoft Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corporation filed Critical Microsoft Corporation
Priority to CN201080016627.6A priority Critical patent/CN102388382B/en
Priority to EP10762238.3A priority patent/EP2417538A4/en
Priority to JP2012504721A priority patent/JP5442846B2/en
Priority to CA2757703A priority patent/CA2757703C/en
Publication of WO2010117889A2 publication Critical patent/WO2010117889A2/en
Publication of WO2010117889A3 publication Critical patent/WO2010117889A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/40Data acquisition and logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

A scalable clustering system is described. In an embodiment the clustering system is operable for extremely large scale applications where millions of items having tens of millions of features are clustered. In an embodiment the clustering system uses a probabilistic cluster model which models uncertainty in the data set where the data set may be for example, advertisements which are subscribed to keywords, text documents containing text keywords, images having associated features or other items. In an embodiment the clustering system is used to generate additional features for associating with a given item. For example, additional keywords are suggested which an advertiser may like to subscribe to. The additional features that are generated have associated probability values which may be used to rank those features in some embodiments. User feedback about the generated features is received and used to revise the feature generation process in some examples.
PCT/US2010/029715 2009-04-10 2010-04-01 Scalable clustering WO2010117889A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201080016627.6A CN102388382B (en) 2009-04-10 2010-04-01 Scalable clustered approach and system
EP10762238.3A EP2417538A4 (en) 2009-04-10 2010-04-01 Scalable clustering
JP2012504721A JP5442846B2 (en) 2009-04-10 2010-04-01 Scalable clustering
CA2757703A CA2757703C (en) 2009-04-10 2010-04-01 Scalable clustering

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US12/421,853 2009-04-10
US12/421,853 US8204838B2 (en) 2009-04-10 2009-04-10 Scalable clustering

Publications (2)

Publication Number Publication Date
WO2010117889A2 WO2010117889A2 (en) 2010-10-14
WO2010117889A3 true WO2010117889A3 (en) 2011-01-20

Family

ID=42935152

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2010/029715 WO2010117889A2 (en) 2009-04-10 2010-04-01 Scalable clustering

Country Status (7)

Country Link
US (1) US8204838B2 (en)
EP (1) EP2417538A4 (en)
JP (1) JP5442846B2 (en)
KR (1) KR101644667B1 (en)
CN (1) CN102388382B (en)
CA (1) CA2757703C (en)
WO (1) WO2010117889A2 (en)

Families Citing this family (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9065727B1 (en) 2012-08-31 2015-06-23 Google Inc. Device identifier similarity models derived from online event signals
US8458154B2 (en) * 2009-08-14 2013-06-04 Buzzmetrics, Ltd. Methods and apparatus to classify text communications
US8990105B1 (en) * 2010-01-07 2015-03-24 Magnetic Media Online, Inc. Systems, methods, and media for targeting advertisements based on user search information
WO2012064893A2 (en) * 2010-11-10 2012-05-18 Google Inc. Automated product attribute selection
US8473437B2 (en) * 2010-12-17 2013-06-25 Microsoft Corporation Information propagation probability for a social network
US9104765B2 (en) * 2011-06-17 2015-08-11 Robert Osann, Jr. Automatic webpage characterization and search results annotation
JP5153925B2 (en) * 2011-07-12 2013-02-27 ヤフー株式会社 Bid object recommendation device, system and method
EP2771806A4 (en) * 2011-10-28 2015-07-22 Blackberry Ltd Electronic device management using interdomain profile-based inferences
US8909581B2 (en) 2011-10-28 2014-12-09 Blackberry Limited Factor-graph based matching systems and methods
US8914262B2 (en) 2011-11-08 2014-12-16 The Mathworks, Inc. Visualization of data dependency in graphical models
US9354846B2 (en) 2011-11-08 2016-05-31 The Mathworks, Inc. Bidomain simulator
US20130116988A1 (en) 2011-11-08 2013-05-09 Fu Zhang Automatic solver selection
US8935137B1 (en) * 2011-11-08 2015-01-13 The Mathworks, Inc. Graphic theoretic linearization of sensitivity analysis
US9377998B2 (en) 2011-11-08 2016-06-28 The Mathworks, Inc. Code generation for control design
US20130116989A1 (en) 2011-11-08 2013-05-09 Fu Zhang Parameter tuning
US20130159254A1 (en) * 2011-12-14 2013-06-20 Yahoo! Inc. System and methods for providing content via the internet
JP5485311B2 (en) * 2012-02-07 2014-05-07 ヤフー株式会社 Advertisement evaluation apparatus, advertisement evaluation method and program
JP5425941B2 (en) * 2012-02-07 2014-02-26 ヤフー株式会社 Advertisement evaluation apparatus, advertisement evaluation method and program
US8880438B1 (en) 2012-02-15 2014-11-04 Google Inc. Determining content relevance
US9053185B1 (en) 2012-04-30 2015-06-09 Google Inc. Generating a representative model for a plurality of models identified by similar feature data
US20150242906A1 (en) * 2012-05-02 2015-08-27 Google Inc. Generating a set of recommended network user identifiers from a first set of network user identifiers and advertiser bid data
US8914500B1 (en) 2012-05-21 2014-12-16 Google Inc. Creating a classifier model to determine whether a network user should be added to a list
EP2973095B1 (en) 2013-03-15 2018-05-09 Animas Corporation Insulin time-action model
US10319035B2 (en) 2013-10-11 2019-06-11 Ccc Information Services Image capturing and automatic labeling system
US9697475B1 (en) 2013-12-12 2017-07-04 Google Inc. Additive context model for entity resolution
US10452992B2 (en) 2014-06-30 2019-10-22 Amazon Technologies, Inc. Interactive interfaces for machine learning model evaluations
US9672474B2 (en) * 2014-06-30 2017-06-06 Amazon Technologies, Inc. Concurrent binning of machine learning data
US9886670B2 (en) 2014-06-30 2018-02-06 Amazon Technologies, Inc. Feature processing recipes for machine learning
US11100420B2 (en) 2014-06-30 2021-08-24 Amazon Technologies, Inc. Input processing for machine learning
US10169715B2 (en) * 2014-06-30 2019-01-01 Amazon Technologies, Inc. Feature processing tradeoff management
US10318882B2 (en) 2014-09-11 2019-06-11 Amazon Technologies, Inc. Optimized training of linear machine learning models
US10963810B2 (en) 2014-06-30 2021-03-30 Amazon Technologies, Inc. Efficient duplicate detection for machine learning data sets
US10540606B2 (en) 2014-06-30 2020-01-21 Amazon Technologies, Inc. Consistent filtering of machine learning data
US10102480B2 (en) 2014-06-30 2018-10-16 Amazon Technologies, Inc. Machine learning service
US10339465B2 (en) 2014-06-30 2019-07-02 Amazon Technologies, Inc. Optimized decision tree based models
US9984159B1 (en) 2014-08-12 2018-05-29 Google Llc Providing information about content distribution
US11182691B1 (en) 2014-08-14 2021-11-23 Amazon Technologies, Inc. Category-based sampling of machine learning data
US20160055495A1 (en) * 2014-08-22 2016-02-25 Wal-Mart Stores, Inc. Systems and methods for estimating demand
US9971683B1 (en) * 2014-10-20 2018-05-15 Sprint Communications Company L.P. Automatic computer memory management coordination across a group of servers
US10846589B2 (en) * 2015-03-12 2020-11-24 William Marsh Rice University Automated compilation of probabilistic task description into executable neural network specification
US10257275B1 (en) 2015-10-26 2019-04-09 Amazon Technologies, Inc. Tuning software execution environments using Bayesian models
CN109983480B (en) * 2016-11-15 2023-05-26 谷歌有限责任公司 Training neural networks using cluster loss
KR102005420B1 (en) * 2018-01-11 2019-07-30 국방과학연구소 Method and apparatus for providing e-mail authorship classification
US10929110B2 (en) 2019-06-15 2021-02-23 International Business Machines Corporation AI-assisted UX design evaluation
CN111813910A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Method, system, terminal device and computer storage medium for updating customer service problem
CN116127346A (en) * 2021-01-20 2023-05-16 国义招标股份有限公司 Density clustering processing method, device and medium independent of history information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6345265B1 (en) * 1997-12-04 2002-02-05 Bo Thiesson Clustering with mixtures of bayesian networks
US20020169730A1 (en) * 2001-08-29 2002-11-14 Emmanuel Lazaridis Methods for classifying objects and identifying latent classes
US20030055614A1 (en) * 2001-01-18 2003-03-20 The Board Of Trustees Of The University Of Illinois Method for optimizing a solution set
US20100034102A1 (en) * 2008-08-05 2010-02-11 At&T Intellectual Property I, Lp Measurement-Based Validation of a Simple Model for Panoramic Profiling of Subnet-Level Network Data Traffic

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6581058B1 (en) 1998-05-22 2003-06-17 Microsoft Corporation Scalable system for clustering of large databases having mixed data attributes
US6564197B2 (en) 1999-05-03 2003-05-13 E.Piphany, Inc. Method and apparatus for scalable probabilistic clustering using decision trees
GB9922221D0 (en) * 1999-09-20 1999-11-17 Ncr Int Inc Classsifying data in a database
IL159332A0 (en) * 1999-10-31 2004-06-01 Insyst Ltd A knowledge-engineering protocol-suite
JP3615451B2 (en) * 2000-03-16 2005-02-02 日本電信電話株式会社 Document classification method and recording medium recording a program describing the classification method
US6952700B2 (en) * 2001-03-22 2005-10-04 International Business Machines Corporation Feature weighting in κ-means clustering
US7246125B2 (en) * 2001-06-21 2007-07-17 Microsoft Corporation Clustering of databases having mixed data attributes
US7231393B1 (en) 2003-09-30 2007-06-12 Google, Inc. Method and apparatus for learning a probabilistic generative model for text
AU2002360442A1 (en) * 2002-10-24 2004-05-13 Duke University Binary prediction tree modeling with many predictors
US7480640B1 (en) * 2003-12-16 2009-01-20 Quantum Leap Research, Inc. Automated method and system for generating models from data
US8117203B2 (en) 2005-07-15 2012-02-14 Fetch Technologies, Inc. Method and system for automatically extracting data from web sites
US7739314B2 (en) 2005-08-15 2010-06-15 Google Inc. Scalable user clustering based on set similarity
US8341158B2 (en) 2005-11-21 2012-12-25 Sony Corporation User's preference prediction from collective rating data
US7647289B2 (en) 2006-06-02 2010-01-12 Microsoft Corporation Learning belief distributions for game moves
US7788264B2 (en) * 2006-11-29 2010-08-31 Nec Laboratories America, Inc. Systems and methods for classifying content using matrix factorization
CN100578500C (en) * 2006-12-20 2010-01-06 腾讯科技(深圳)有限公司 Web page classification method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6345265B1 (en) * 1997-12-04 2002-02-05 Bo Thiesson Clustering with mixtures of bayesian networks
US20030055614A1 (en) * 2001-01-18 2003-03-20 The Board Of Trustees Of The University Of Illinois Method for optimizing a solution set
US20020169730A1 (en) * 2001-08-29 2002-11-14 Emmanuel Lazaridis Methods for classifying objects and identifying latent classes
US20100034102A1 (en) * 2008-08-05 2010-02-11 At&T Intellectual Property I, Lp Measurement-Based Validation of a Simple Model for Panoramic Profiling of Subnet-Level Network Data Traffic

Also Published As

Publication number Publication date
KR20110138229A (en) 2011-12-26
CN102388382A (en) 2012-03-21
WO2010117889A2 (en) 2010-10-14
CA2757703C (en) 2017-01-17
CA2757703A1 (en) 2010-10-14
EP2417538A4 (en) 2016-08-31
JP2012523621A (en) 2012-10-04
US8204838B2 (en) 2012-06-19
JP5442846B2 (en) 2014-03-12
CN102388382B (en) 2015-11-25
EP2417538A2 (en) 2012-02-15
KR101644667B1 (en) 2016-08-01
US20100262568A1 (en) 2010-10-14

Similar Documents

Publication Publication Date Title
WO2010117889A3 (en) Scalable clustering
Srifi et al. Recommender systems based on collaborative filtering using review texts—a survey
Devon Hjelm et al. Learning deep representations by mutual information estimation and maximization
WO2007016058A3 (en) System and method for providing profile matching with an unstructured document
WO2011159516A3 (en) Semantic content searching
Kutlimuratov et al. Evolving hierarchical and tag information via the deeply enhanced weighted non-negative matrix factorization of rating predictions
Shin et al. Google TV or Apple TV?—The Reasons for Smart TV Failure and a User-Centered Strategy for the Success of Smart TV
CN109241261A (en) User's intension recognizing method, device, mobile terminal and storage medium
CN106294500A (en) The method for pushing of content item, Apparatus and system
Whitman The NNI and the Nanotechnology-Inspired Grand Challenge for Future Computing
Lee et al. Emotional Interaction and Nofitication of Flexible Handheld Devices
Vallejo-Solarte et al. Nutritional status and social determinants in children between 0 and 5 years old from the community of Yunguillo and" Red Unidos" in Mocoa, Colombia
Roodman Mixed-process models with cmp
Liao et al. Improving efficiency of recommender systems
Trường Developing A BERT Based Triple Classification Model Using Knowledge Graph Embedding for Question Answering System
Bagrezaei et al. A note on further statistical analysis of geometric distribution
Tudor Ionescu et al. Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation
Chuah et al. Researching Gen Z’s Smartwatch Perception and Purchase Propensities
Kuhn et al. Sustainability in Food: AI-based interactions, based on data fusion for consumer protection by the government
Qiao et al. Research on Group Recommendation Mode based on Item Features and Time Factor
Rao Koluguri et al. Meta-learning for robust child-adult classification from speech
Andah Classifying alien civilizations with the Kardashev scale
Tiongson A short history of the Philippine sarsuwela (1879-2009)
Carruba et al. VizieR Online Data Catalog: Asteroid families identification (Carruba+, 2013)
Maruyama Action research on designing an EAP listening material: modifying and adapting lecture videos to match students' needs

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201080016627.6

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10762238

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 2010762238

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2757703

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 20117023622

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 7775/DELNP/2011

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2012504721

Country of ref document: JP