WO2010117581A1 - Similarity-based feature set supplementation for classification - Google Patents

Similarity-based feature set supplementation for classification Download PDF

Info

Publication number
WO2010117581A1
WO2010117581A1 PCT/US2010/027709 US2010027709W WO2010117581A1 WO 2010117581 A1 WO2010117581 A1 WO 2010117581A1 US 2010027709 W US2010027709 W US 2010027709W WO 2010117581 A1 WO2010117581 A1 WO 2010117581A1
Authority
WO
WIPO (PCT)
Prior art keywords
items
item
media content
feature
neighbor
Prior art date
Application number
PCT/US2010/027709
Other languages
French (fr)
Inventor
Yu He
David Petrie Stoutamire
Original Assignee
Google Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google Inc. filed Critical Google Inc.
Priority to EP10762077.5A priority Critical patent/EP2417544A4/en
Priority to CN2010800220637A priority patent/CN102428467A/en
Priority to CA2757771A priority patent/CA2757771A1/en
Publication of WO2010117581A1 publication Critical patent/WO2010117581A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the invention relates generally to classifying items that are associated with sparse or unknown data.
  • embodiments of the invention are directed toward classifying media content items associated with sparse or unknown feature datasets using feature datasets associated with related media content items.
  • Media hosting services allow users to upload media content where it can be shared with others for public viewing.
  • Media content provided by the users may include, for example, textual content (e.g. blogs), video content, audio content and image content.
  • Media hosting services can host millions of media content items.
  • users uploading content provide labels or tags to describe the media content by associating with the media content with one or more categories.
  • Other users may browse or search for media content by providing keywords to search the information describing the media content such as the title, a summary of the media content, as well as the labels and tags.
  • information provided by the users to describe the media content is often sparse, inconsistent and/or inaccurate.
  • user-provided labels are often inconsistent as they are provided by different users and are subject to user opinion on what the media content is about. For instance, one user may provide a label indicating a news video discussing the rising the cost of gasoline relates to the "Environment,” while another user may provide label indicating the same news video relates to "Politics.”
  • the use of statistical classification techniques provides one method of standardizing assignment of labels indicating classes.
  • a statistical model or "classifier” is computationally generated.
  • the classifier specifies a set of features and their associated relevance in determining whether an item belongs to a class of items.
  • This classifier is applied to feature data associated with an item to determine whether the item has a correspondence to the class of items.
  • statistical classification provides an efficient and standardized method of assigning class labels to items, this technique is most effective in instances when the items are associated with large amounts of accurate feature data.
  • user provided media content is often associated with sparse or inconsistent feature data.
  • existing statistical classification methods do not provide an effective means for classifying content based on its labels and descriptive information.
  • Embodiments of the present invention enable the generation of a set of class labels for labeling a media content item.
  • An embodiment of a method according to the present invention comprises a computer-implemented method for generating a set of class labels for labeling an item.
  • a set of neighbor items associated with a first item is identified based, in part, on a first feature set associated with the first item, wherein each neighbor item of the set of neighbor items is associated with a feature set.
  • a supplemented feature set is generated for the first item based on the identified set of neighbor items responsive to combining the first feature set and the features sets associated with the set of neighbor items.
  • a set of classification scores associated the first item is generated based on the supplemented feature set, each classification score of the set of classification scores indicating a likelihood that the first item belongs to a class of items.
  • FIG. 1 is a high-level block diagram of a system environment according to one embodiment.
  • FIG. 2 is a screenshot illustrating an interface for browsing media content associated with categories according to one embodiment.
  • Fig. 3 is a high-level block diagram illustrating a detailed view for the media host server according to one embodiment.
  • Fig. 4 is a flow-chart illustrating steps performed by the media host server to generate a similarity graph according to one embodiment.
  • Fig. 5 is a flow-chart illustrating steps performed by the media host server to generate a set of class labels for a media content item according to one embodiment.
  • Fig. 6 is a flow-chart illustrating steps performed by the media host server to refine a set of class labels associated with an media content item according to one embodiment.
  • FIG. 1 illustrates a system environment 100 comprising a media host service 104, a plurality of content providers 102 and a plurality of content viewers 106 connected by a network 114. Only three content viewers 106 are shown in FIG. 1 in order to simplify and clarify the description. Embodiments of the system environment 100 can have thousands or millions of content viewers 106 and/or content providers 102 connected to the network 114.
  • the media host service 104 communicates with content viewers 106 over the network 114.
  • the media host service 104 receives uploaded media content from content providers 102 and allows content to be viewed by content viewers 106.
  • Media content may be uploaded to the media host service 104 via the Internet from a personal computer, through a cellular network from a telephone or PDA, or by other means for transferring data over the network 114.
  • Media content may be downloaded from the media host service 104 in a similar manner; in one embodiment media content is provided as a file download to a content viewer 106; in an alternative embodiment, media content is streamed to the content viewer.
  • the means by which media content is received by the media host service 104 need not match the means by which it is delivered to a content viewer 106.
  • the content provider 102 may upload a video via a browser on a personal computer, whereas the content viewer 106 may view that video as a stream sent to a PDA.
  • a content provider 102 can also provide media content to the media host service 104. Examples of media content include audio, video, image and text content; other forms of content available for consumption may also be provided. The media content may have been created by content provider 102, but need not have been.
  • Content viewers 106 view media content provided by the media host service 104 via a user interface.
  • a content viewer 106 runs a web browser such as Microsoft Internet Explorer or Mozilla Firefox.
  • a media host service 104 includes a web server such as Microsoft Internet Information Services.
  • a content viewer 106 browses and searches for content provided by the media host service 104 and views content of interest, including video content.
  • the content viewer 106 uses other types of software applications to view, browse and search media content from the media host service 104.
  • the content viewer 106 also provides viewing metrics to the media host service 104.
  • the media host service 104 further functions to generate class labels for media content items. For a given media content item (a "target item") the media host service 104 identifies other media content items, herein referred to as “neighbor media content items,” that are similar to the target item, based on feature data associated with the target and neighbor media content items. Feature data associated with the media content items may include: user provided data associated with the media content items, viewing information associated with the media content items, and content data generated from the media content items. The media host service 104 generates a supplemented feature set for the target item by combining the feature data associated with the target media content item and the feature data associated with the neighbor media content items.
  • the media host service 104 classifies the target media content item based on the supplemented feature set to generate a set of class labels for the target media content item.
  • the media host service 104 further refines the set of class labels for the target media content item based on class labels associated with the neighbor media content items.
  • the media host service 104 leverages existing similarities in media content items to supplement feature data sets that are sparse, inconsistent and/or uncertain using feature data associated with neighbor media content items. This improves the classification of media content items that would not otherwise have a large enough feature set for classification.
  • the media host service 104 further leverages feature data from multiple independent sources to identify neighbor media content items and classify media content items. By integrating data from different sources, the media host service 104 compensates for sparseness and/or uncertainty associated with one source of feature data using feature data from an independent source of data. For example, sparseness associated with user- provided data (e.g. a lack of subject, title, or tags in the user-provided data) may be compensated for based on viewing information such as viewing statistics (e.g. a the number of times media content items are requested by a same user) to identify similarity between two media content items. Likewise, uncertainty in similarity between two media content items based on viewing statistics (i.e.
  • FIG. 2 illustrates a screenshot of a graphical user interface 200 for browsing media content items provided by the media host serverlO4 according to an embodiment.
  • the media content is video content.
  • the media host service 104 may provide a graphical user interface for browsing other types of media content including songs, images, and text content.
  • the graphical user interface 200 includes a display window 215 for displaying a video and an information window 240 for displaying information describing the video.
  • the information window 240 displays set of class labels 244 that describe a set of categories or classes the video is associated with.
  • a user may browse other videos that belong to the same categories or classes. For instance, a user may select the class label 'debate' to retrieve a set of videos associated with the label 'debate'.
  • the graphical user interface 200 further includes a related videos window 230 containing a set of videos that are related to the video. In the embodiment illustrated, the related videos are based in part on the class labels associated with the related videos.
  • the related videos are displayed with associated class labels 235 that overlap with the set of class labels 244 associated with the displayed video.
  • FIG. 3 is a high-level block diagram illustrating a detailed view of the media host service 104 according to one embodiment.
  • the media host service 104 includes several modules and servers. Those of skill in the art will recognize that other embodiments can have different module and/or servers than the ones described here, and that the functionalities can be distributed among the module and/or servers in a different manner. In addition, the functions ascribed to the media host service 104 can be performed by multiple servers.
  • the media upload server 306, the media content database 330 and/or the feature set database 350 may be hosted at one or more separate servers by different entities with the media host service 104 acting as a third party server to generate class labels for media received by the media upload server 306 and stored in the media content database 330.
  • the media upload server 306 receives media content uploaded by the content providers 102.
  • the media upload server 306 stores uploaded media content in the media content database 330.
  • the media upload server 306 further receives information derived from providing the media content to the content viewers 106 such as ratings associated with the media content and uploaded comments about the media content.
  • the media content database 330 stores received media content in association with unique identifiers for that media content.
  • the media content database 330 further stores user-provided information describing the media content such as an author of the media content, the date the media content was received by the media host sever 104, the subject of the media content, tags or labels associated with the media content and comments provided by an author of the media content.
  • the media content database 330 further stores viewing information derived from providing the media content to content viewers 106 such as ratings of the media content provided by users, comments provided by the users, and the frequency at which the media content is viewed by the users.
  • the media content database 330 also stores viewing information that is specific to a media content item such as a set of media content items that are commonly viewed in association with the media content item.
  • the media content server 310 provides information and media content to the users.
  • the media content server 310 retrieves media content from the media content database 330.
  • the media content server 310 provides the retrieved media content to the content viewers 106.
  • the media content server 310 further functions to retrieve and provide information and media content responsive to search queries received from the content viewers 106.
  • the search queries may include criteria including search terms, class labels etc.
  • the media content server 310 further retrieves items of related media content based in part on the class labels associated with a selected media content item and provides the related media content items to the content viewer 106.
  • the media content server 310 further monitors viewing statistics and other viewing information associated with the media content such as the frequency at which the media content is viewed or and stores the viewing information to the media content database 330.
  • the content feature engine 312 generates content features based on the media content.
  • Content features are metadata generated from the media content that can be used to characterize the media content.
  • the content feature engine 312 generates content features specific to the media type of the media content.
  • content features may include: pixel intensity, luminosity, data derived from shape detection algorithms and other data derived from still images.
  • content features may include: pitch, tone, mel-frequency cepstral coefficients (MFC), and other data derived from audio content.
  • content features may include data derived from shot detection algorithms, face detection algorithms, edge detection algorithms, and other data derived from video content, such as color, luminosity, texture and other features.
  • the content feature engine 312 stores the generated content features in the feature set database 350.
  • the text feature engine 308 generates text features based on the user-provided information describing the media content.
  • the text feature engine 308 generates text features which comprise one or more tokens and a numeric value associated with the token, such as a frequency value.
  • the text feature engine 308 generates the text features by tokenizing the user provided information and determining the frequency of the tokens contained there.
  • the text feature engine 308 may also stem the tokens or use lexicons to identify synonymous tokens prior to enumerating the frequency of the tokens.
  • the text feature engine 308 generates text features comprised of phrases such as noun phrases or verb phrases.
  • the frequency information for the tokens can be raw frequency information, or normalized, such as TF-IDF or similar frequency measures.
  • the text feature engine 308 generates the text features based on information describing the media content such as the title and summary associated with the media content. In other embodiments, the text feature engine 308 generates the text features from comments associated with the media content (e.g., as provided by users who view the media content item) and/or other sources of textual data referenced by the information describing the media content (e.g. web pages referenced in the summary associated with the media content). The text feature engine 308 further generates text features from video or image content using techniques such as speech recognition as applied to an audio track of a media content item, and optical character recognition (OCR) as applied to the images contained in a media content item.
  • OCR optical character recognition
  • the feature set database 350 stores feature sets for media content items in association with unique identifiers for the media content items.
  • the feature sets include the text features generated by the text feature engine 308 and the content features generated by the content feature engine 312.
  • the feature sets further include viewing statistics and other viewing information stored in the media content database 330 such as the frequency a media content item is viewed by users and a set of frequencies specifying the number of times other media items are viewed in association with the media content item, herein referred to as "co-watch metrics.” These frequencies can be raw or normalized, as determined by the system administrator.
  • the similarity graph module 309 identifies neighbor media content items for media content items based on the feature sets.
  • the similarity graph module 309 first generates a set of distance metrics which specify a measure of similarity between two media content items. Based on the distance metrics, the similarity graph module 309 identifies neighbor media content items.
  • the similarity graph module 309 generates a set of distance metrics based on the feature sets stored in association with the media content items in the feature set database 350. For each pair of feature sets in the feature set database 350 associated with a respective a first and second content item, the similarity graph module 309 generates a distance metric which indicates the similarity between the pair of feature sets for the items.
  • the distance metric can be a Euclidean distance metric generated based on corresponding features in the two feature sets. In other embodiments, the distance metric may be a correlation co-efficient between the corresponding features.
  • the similarity graph module 309 may generate the distance metrics based on all of the features in the feature sets or a sub-portion of the features in the feature sets.
  • the similarity graph module 309 may generate the distance metric based on a specific type of feature in the feature sets. For instance, the similarity graph module 309 may generate the distance metric based only on viewing information such as co-watch metrics. The similarity graph module 309 stores the distance metrics in association with the feature sets and media content items in the media content database 330.
  • the similarity graph module 309 filters the media content items in the feature set database 350, for example by removing the items from the feature set database 350 or flagging the media content items in the feature set database 350, prior to generating the set of distance metrics.
  • the similarity graph module 309 filters the media content items according to a set of specified features that indicate a media content item is to be filtered.
  • the set of specified features are features that indicate that the media content item is an item of undesirable content.
  • features that indicate that the media content item is an item of undesirable content are specified by an administrator of the media host serverlO4 and can include features that indicate that the media content item comprises spam, adult content, or hate speech.
  • the similarity graph module 309 identifies neighbor media content items based on the distance metrics associated with the media content items. For each target item, the similarity graph module 309 selects a set of neighbor items based on the distance metrics which have some measure of similarity to the target item. Suitable methods for selecting a set of neighbor media content items may include clustering the distance metrics. [0036] In one embodiment, the similarity graph module 309 selects a set of neighbor media content items by generating a similarity graph of the content items based on the distance metrics. The similarity graph module 309 generates a similarity graph containing a set of nodes, each of which represents a media content item in the feature set database 350.
  • the similarity graph module 370 selects a node representing a media content item as the target node.
  • the similarity graph module 309 connects each the target node to five (5) of the most similar media content items, based on their respective distance measure.
  • the similarity graph module repeats this process by selecting each node as the target node and assigning edges between the target node and a set of nodes.
  • the similarity graph module 309 connects the target node to the maximum number of media items with distance metrics indicating at least a minimum degree of similarity to the media content item represented by the target node. If the similarity graph module is unable to find any media content items with distance metrics indicating the minimum degree of similarity to the media content item represented by the target node, the similarity graph module 309 connects the target node representing the media content item to a media item with a distance metric indicating a maximum similarity of all the distance metrics associated with the target node.
  • the similarity graph module 309 prunes the similarity graph after the similarity graph is constructed using one or more pruning criteria.
  • the similarity graph module 309 may remove media content items based on viewing statistics that indicates which media content items in the graph are not actively viewed by content viewers 106.
  • the viewing statistics that indicates that the media content item is not actively viewed on the by content viewers 106 are specified by an administrator of the media host service 104 and may include the statistics as the views, ratings, or comments associated with the media content items; these statistics can include raw or normalized counts (e.g.
  • items can be pruned base on their co-watch metrics. For a given target item, the neighbor content items with the least significant (e.g., lowest valued) co-watch metrics can be pruned.
  • the above pruning criteria are applied to the neighbor content items for each target node until the nodes have been examined according the criteria. These pruning criteria can be applied in any order desired by the system administrator.
  • M number of media content items is equal to the N number of edges connecting each target node to items of media content.
  • the similarity graph module 370 traverses the similarity graph to identify neighbor media content items.
  • the similarity graph module 370 traverses the similarity graph by selecting a neighbor node representing a neighbor media content item with a distance metric indicating the highest similarity to the target media content item.
  • the similarity graph module 370 selects a node connected to the neighbor node with a distance metric that indicates a highest similarity to the neighbor node as a neighbor media content item.
  • the similarity graph module 370 continues this process until all the number of neighbor media content items equals the specified number of neighbor media content items.
  • the foregoing processes of identifying neighbor content items, generating a similarity graph, and pruning the similarity graph provide a robust set of content items that have a very high likelihood of being meaningfully related to each, based both on their intrinsic features as well as the extrinsic behaviors of the users who view the items.
  • Using the pruning criteria to prune the similarity graph leverages behavioral information from the user community as to which content items are of sufficient interest and are sufficiently related to each other.
  • the data aggregation module 314 generates supplemented feature sets for each media content item based on the set of neighbor media content items associated with the media content item.
  • the data aggregation module 314 combines the feature sets to generate a supplemented feature set.
  • the data aggregation module 314 generates the supplemented feature sets based on all (or a portion of) the data in the feature sets of all (or a selected subset of) the neighbor items. In one embodiment, the supplemented feature sets are generated based only on text features associated with the media content items.
  • the data aggregation module 314 combines the feature sets associated with the target media content item and neighbor media content items to generate a supplemented feature set.
  • the data aggregation module 314 combines the feature sets by simply merging the feature sets to generate a combined, unordered, and un-weighted feature set containing all of the features of all of the neighbor items.
  • the data aggregation module 314 merges features which occur in both data sets by adding, averaging, or otherwise mathematically combining the values associated with the features, such that for each type of feature, there is a single value (or set of values) as appropriate for the feature.
  • the data aggregation module 314 may produce an average color histogram from the color histograms of the neighbor content items (hence a set of frequency counts for color bins), whereas for a ratings feature, the data aggregation module 314 may produce a single average rating from the neighbor content items. In other embodiments, the data aggregation module 314 combines the feature sets associated with the neighbor media content items by weighting the feature sets based on the similarity values of the neighbor media content items.
  • the data aggregation module 314 uses consensus methods to identify features in the feature sets associated with the neighbor media content items to add to the supplemented feature set associated with the target media content item. In these embodiments, the data aggregation module 314 identifies features which have a range of values in the majority or percentage of the feature sets associated with the neighbor media content items. For example, the data aggregation module 314 can identify that a feature 'average volume' has a narrow range of values (e.g. each value is 9 or 10 on a scale of 1-10) in more than 80% of the feature sets associated with neighbor items of media content. The data aggregation module 214 may determine to add an average value of the feature "average volume' (i.e.
  • the classification engine 312 classifies each media content item based on the supplemented feature set associated with the media content item.
  • the classification engine 312 applies one or more classifiers 322 to the supplemented features sets to generate a set of classification scores indicating the likelihood that the media content item belongs to a class or category of media content items.
  • the classification engine 312 assigns a set of one or more class labels to the media content item based the classification score indicating the likelihood that the media content item belongs to a class or category of media content items exceeding a defined threshold value. For example, an media content item may be assigned a label of "soccer" responsive to a classification score that the media content item belongs to the class "soccer" being greater than 90%.
  • the classification engine 312 stores set of class labels in association with the media content item in the classified media corpus 380.
  • the classifier 322 may be generated by the classification engine 312 or received from another source.
  • the classifier 322 is a single multi-class classifier which is trained on corpus of content items which are classified according to a hierarchical system of classification.
  • the single multi-class classifier is trained on a corpus of content items which are classified according to the hierarchical system of classifications used by the Open Directory Project (ODP).
  • ODP Open Directory Project
  • the training set of media content items is manually classified, so that each training media content item has one or more labels from the OPD.
  • the training set of media content items is then processed for their features and viewing statistics and the classifier 322 is constructed and validated using the training set of media content items and a corresponding validation set.
  • the classifiers 322 may be binary classifiers and the classes may be non-hierarchical.
  • the classifier 322 is re-trained using the classified media corpus 380.
  • a second stage of classification is provided by the class label engine 315. More specifically, the class label engine 315 refines the set of class labels associated with a target media content item based on the class labels of its neighbor media content items. The class label engine 315 obtains the set of class labels associated with the target media content item in the classified media corpus 380. For each class label associated with the media content item, the class label engine 315 determines a class consensus value specifying the number or percentage of the neighbor media content items that are also associated with the class label in the classified media corpus 380. If the class consensus value is below a threshold value, the class label engine 315 removes the class label from the set of labels associated with the target media content item in the classified media corpus 380.
  • the class label engine 315 can identify that the class label "soccer" is associated with 5 out of 6 neighbor media content items of the target media content item in the classified media corpus 380. The class label engine 315 can then determine that the class consensus value of, for example 83%, associated with the label is greater than a threshold value of 33% and retain the class label "soccer" in association with the target media content item.
  • the class label engine 315 can determine that the corresponding class consensus value of 0% is less than a threshold value of 33% and remove the class label "religious" from the set of class labels associated with the target media content item.
  • the threshold consensus value may be specified, for example by an administrator of the media host service 104, or determined by the class label engine 315.
  • the class label engine 315 may determine the threshold value based on many factors.
  • the threshold value is dependent on a level of specificity associated with the class label defined by the hierarchical classification scheme. For instance, the threshold value for a label which specifies "soccer" may be a smaller value than the threshold value for a label which specifies "sports”.
  • the threshold value is dependent on the relative frequency with which a label occurs in a corpus.
  • the threshold value for a label which specifies "soccer” and the threshold value for a label which specifies "lawn bowling” may be proportional to their frequency in the corpus with the threshold value for "soccer” being 5 times greater than the threshold value for "lawn bowling".
  • FIG. 4 is a flowchart illustrating steps performed by the media host service 104 to identify a set of neighbor media content items for a media content item in accordance with an embodiment of the present invention. Other embodiments perform the illustrated steps in different orders, and/or perform different or additional steps. Moreover, some of the steps can be performed by engines or modules other than the media host service 104.
  • the media host service 104 identifies 404 a set of feature sets associated with a set of media items.
  • the media host service 104 filters 406 the set of feature sets associated with set of media items.
  • the media host service 104 generates 408 a set of distance metrics, each distance metric specifying the similarity of the feature sets associated with pairs of media items.
  • the media host service 104 generates 410 a similarity graph based on the set of distance metrics.
  • the media host service 104 prunes 412 the similarity graph to remove media content items.
  • the media host service 104 identifies 414 a set of neighbor media content items for each media content item in the similarity graph.
  • FIG. 5 is a flowchart illustrating steps performed by the media host service 104 to classify a media content item in accordance with an embodiment of the present invention. Other embodiments perform the illustrated steps in different orders, and/or perform different or additional steps. Moreover, some of the steps can be performed by engines or modules other than the media host service 104.
  • the media host service 104 identifies 512 a set of neighbor media content items for the target media content item.
  • the media host service 104 generates 514 a supplemented feature set for the target media content item based on the sets of feature data associated with the set of neighbor media content items.
  • the media host service 104 generates 516 a set of class labels associated with the target media content item.
  • FIG. 6 is a flowchart illustrating steps performed by the media host service 104 to refine a set of class labels associate with a target node 322 in accordance with an embodiment of the present invention. Other embodiments perform the illustrated steps in different orders, and/or perform different or additional steps. Moreover, some of the steps can be performed by engines or modules other than the media host service 104.
  • the media host service 104 identifies 610 a set of class labels associated with a target node.
  • the media host service 104 determines 612 a class consensus value for each class label based on the class labels associated with the neighbor media content items for the target node.
  • the media host service 104 removes 614 the class labels having a class consensus value beneath a threshold value from the set of class labels associated with the target node.
  • Certain aspects of the present invention include process steps and instructions described herein in the form of an algorithm. All such process steps, instructions or algorithms are executed by computing devices that include some form of processing unit (e.g,. a microprocessor, microcontroller, dedicated logic circuit or the like) as well as a memory (RAM, ROM, or the like), and input/output devices as appropriate for receiving or providing data.
  • processing unit e.g. a microprocessor, microcontroller, dedicated logic circuit or the like
  • RAM random access memory
  • ROM read only memory
  • input/output devices as appropriate for receiving or providing data.
  • the present invention also relates to an apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer, in which event the general-purpose computer is structurally and functionally equivalent to a specific computer dedicated to performing the functions and operations described herein.
  • a computer program that embodies computer executable data e.g.
  • program code and data is stored in a tangible computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for persistently storing electronically coded instructions.
  • a tangible computer readable storage medium such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for persistently storing electronically coded instructions.
  • Such computer programs by nature of their existence as data stored in a physical medium by alterations of such medium, such as alterations or variations in the physical structure and/or properties (e.g., electrical, optical, mechanical, magnetic, chemical properties) of the medium, are not abstract ideas or concepts or representations per se, but instead are physical artifacts produced by physical processes that transform a physical medium from one state to another state (e.g., a change in the electrical charge, or a change in magnetic polarity) in order to persistently store the computer program in the medium.
  • the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.

Abstract

A set of neighbor items associated with a first item is identified based, in part, on a first feature set associated with the first item, wherein each neighbor item of the set of neighbor items is associated with a feature set. A supplemented feature set is generated for the first item based on the identified set of neighbor items responsive to combining the first feature set and the features sets associated with the set of neighbor items. A set of classification scores associated the first item is generated based on the supplemented feature set, each classification score of the set of classification scores indicating a likelihood that the first item belongs to a class of items.

Description

SIMILARITY-BASED FEATURE SET SUPPLEMENTATION FOR CLASSIFICATION
BACKGROUND
Field of the Invention
[0001] The invention relates generally to classifying items that are associated with sparse or unknown data. In particular, embodiments of the invention are directed toward classifying media content items associated with sparse or unknown feature datasets using feature datasets associated with related media content items.
Description of Background Art
[0002] Media hosting services allow users to upload media content where it can be shared with others for public viewing. Media content provided by the users may include, for example, textual content (e.g. blogs), video content, audio content and image content. Media hosting services can host millions of media content items. Typically, users uploading content provide labels or tags to describe the media content by associating with the media content with one or more categories. Other users may browse or search for media content by providing keywords to search the information describing the media content such as the title, a summary of the media content, as well as the labels and tags. However, information provided by the users to describe the media content is often sparse, inconsistent and/or inaccurate. In particular, user-provided labels are often inconsistent as they are provided by different users and are subject to user opinion on what the media content is about. For instance, one user may provide a label indicating a news video discussing the rising the cost of gasoline relates to the "Environment," while another user may provide label indicating the same news video relates to "Politics."
[0003] The use of statistical classification techniques provides one method of standardizing assignment of labels indicating classes. In statistical classification techniques, a statistical model or "classifier" is computationally generated. The classifier specifies a set of features and their associated relevance in determining whether an item belongs to a class of items. This classifier is applied to feature data associated with an item to determine whether the item has a correspondence to the class of items. Although statistical classification provides an efficient and standardized method of assigning class labels to items, this technique is most effective in instances when the items are associated with large amounts of accurate feature data. As described above, user provided media content is often associated with sparse or inconsistent feature data. As a result, existing statistical classification methods do not provide an effective means for classifying content based on its labels and descriptive information.
SUMMARY OF THE INVENTION
[0004] Embodiments of the present invention enable the generation of a set of class labels for labeling a media content item.
[0005] An embodiment of a method according to the present invention comprises a computer-implemented method for generating a set of class labels for labeling an item. A set of neighbor items associated with a first item is identified based, in part, on a first feature set associated with the first item, wherein each neighbor item of the set of neighbor items is associated with a feature set. A supplemented feature set is generated for the first item based on the identified set of neighbor items responsive to combining the first feature set and the features sets associated with the set of neighbor items. A set of classification scores associated the first item is generated based on the supplemented feature set, each classification score of the set of classification scores indicating a likelihood that the first item belongs to a class of items.
[0006] The features and advantages described in this summary and the following detailed description are not all-inclusive. Many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims hereof.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] Fig. 1 is a high-level block diagram of a system environment according to one embodiment.
[0008] Fig. 2 is a screenshot illustrating an interface for browsing media content associated with categories according to one embodiment.
[0009] Fig. 3 is a high-level block diagram illustrating a detailed view for the media host server according to one embodiment. [0010] Fig. 4 is a flow-chart illustrating steps performed by the media host server to generate a similarity graph according to one embodiment.
[0011] Fig. 5 is a flow-chart illustrating steps performed by the media host server to generate a set of class labels for a media content item according to one embodiment.
[0012] Fig. 6 is a flow-chart illustrating steps performed by the media host server to refine a set of class labels associated with an media content item according to one embodiment.
[0013] The figures depict preferred embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
DETAILED DESCRIPTION
[0014] FIG. 1 illustrates a system environment 100 comprising a media host service 104, a plurality of content providers 102 and a plurality of content viewers 106 connected by a network 114. Only three content viewers 106 are shown in FIG. 1 in order to simplify and clarify the description. Embodiments of the system environment 100 can have thousands or millions of content viewers 106 and/or content providers 102 connected to the network 114. The media host service 104 communicates with content viewers 106 over the network 114. The media host service 104 receives uploaded media content from content providers 102 and allows content to be viewed by content viewers 106. Media content may be uploaded to the media host service 104 via the Internet from a personal computer, through a cellular network from a telephone or PDA, or by other means for transferring data over the network 114. Media content may be downloaded from the media host service 104 in a similar manner; in one embodiment media content is provided as a file download to a content viewer 106; in an alternative embodiment, media content is streamed to the content viewer. The means by which media content is received by the media host service 104 need not match the means by which it is delivered to a content viewer 106. For example, the content provider 102 may upload a video via a browser on a personal computer, whereas the content viewer 106 may view that video as a stream sent to a PDA. Note also that the media host service 104 may itself serve as the content provider 102. [0015] A content provider 102 can also provide media content to the media host service 104. Examples of media content include audio, video, image and text content; other forms of content available for consumption may also be provided. The media content may have been created by content provider 102, but need not have been.
[0016] Content viewers 106 view media content provided by the media host service 104 via a user interface. Typically, a content viewer 106 runs a web browser such as Microsoft Internet Explorer or Mozilla Firefox. A media host service 104 includes a web server such as Microsoft Internet Information Services. Using a browser, a content viewer 106 browses and searches for content provided by the media host service 104 and views content of interest, including video content. In some embodiments, the content viewer 106 uses other types of software applications to view, browse and search media content from the media host service 104. As described further below, the content viewer 106 also provides viewing metrics to the media host service 104.
[0017] The media host service 104 further functions to generate class labels for media content items. For a given media content item (a "target item") the media host service 104 identifies other media content items, herein referred to as "neighbor media content items," that are similar to the target item, based on feature data associated with the target and neighbor media content items. Feature data associated with the media content items may include: user provided data associated with the media content items, viewing information associated with the media content items, and content data generated from the media content items. The media host service 104 generates a supplemented feature set for the target item by combining the feature data associated with the target media content item and the feature data associated with the neighbor media content items. The media host service 104 classifies the target media content item based on the supplemented feature set to generate a set of class labels for the target media content item. The media host service 104 further refines the set of class labels for the target media content item based on class labels associated with the neighbor media content items. By identifying neighbor media content items and generating supplemented feature sets therefrom, the media host service 104 leverages existing similarities in media content items to supplement feature data sets that are sparse, inconsistent and/or uncertain using feature data associated with neighbor media content items. This improves the classification of media content items that would not otherwise have a large enough feature set for classification.
[0018] The media host service 104 further leverages feature data from multiple independent sources to identify neighbor media content items and classify media content items. By integrating data from different sources, the media host service 104 compensates for sparseness and/or uncertainty associated with one source of feature data using feature data from an independent source of data. For example, sparseness associated with user- provided data (e.g. a lack of subject, title, or tags in the user-provided data) may be compensated for based on viewing information such as viewing statistics (e.g. a the number of times media content items are requested by a same user) to identify similarity between two media content items. Likewise, uncertainty in similarity between two media content items based on viewing statistics (i.e. uncertainty that videos that are watched by same viewers have a same subject or category) may be compensated for using content data generated for the video (e.g. data generated from face recognition that indicates the same individual is shown in two videos provided from separate sources). [0019] Combining feature data associated with a target media content item and neighbor media content items to generate a supplemented data set provides a preprocessing step in classification. The addition of this step allows for "fine tuning" of the classifier accuracy by the manipulation of the different ways that feature data can be combined. An administrator of the media host service 104 may optimize classifier accuracy by choosing different algorithms and weighting techniques to combine feature data based on the types of features, the accuracy of the classifier, the degree of sparseness in the feature data sets, the degree of inaccuracy in the feature data sets and other factors. [0020] FIG. 2 illustrates a screenshot of a graphical user interface 200 for browsing media content items provided by the media host serverlO4 according to an embodiment. In the embodiment illustrated, the media content is video content. In other embodiments, the media host service 104 may provide a graphical user interface for browsing other types of media content including songs, images, and text content.
[0021] The graphical user interface 200 includes a display window 215 for displaying a video and an information window 240 for displaying information describing the video. The information window 240 displays set of class labels 244 that describe a set of categories or classes the video is associated with. Using the set of class labels, a user may browse other videos that belong to the same categories or classes. For instance, a user may select the class label 'debate' to retrieve a set of videos associated with the label 'debate'. [0022] The graphical user interface 200 further includes a related videos window 230 containing a set of videos that are related to the video. In the embodiment illustrated, the related videos are based in part on the class labels associated with the related videos. The related videos are displayed with associated class labels 235 that overlap with the set of class labels 244 associated with the displayed video.
[0023] FIG. 3 is a high-level block diagram illustrating a detailed view of the media host service 104 according to one embodiment. As shown in FIG. 3, the media host service 104 includes several modules and servers. Those of skill in the art will recognize that other embodiments can have different module and/or servers than the ones described here, and that the functionalities can be distributed among the module and/or servers in a different manner. In addition, the functions ascribed to the media host service 104 can be performed by multiple servers.
[0024] In alternate embodiments, the media upload server 306, the media content database 330 and/or the feature set database 350 may be hosted at one or more separate servers by different entities with the media host service 104 acting as a third party server to generate class labels for media received by the media upload server 306 and stored in the media content database 330.
[0025] The media upload server 306 receives media content uploaded by the content providers 102. The media upload server 306 stores uploaded media content in the media content database 330. The media upload server 306 further receives information derived from providing the media content to the content viewers 106 such as ratings associated with the media content and uploaded comments about the media content.
[0026] The media content database 330 stores received media content in association with unique identifiers for that media content. The media content database 330 further stores user-provided information describing the media content such as an author of the media content, the date the media content was received by the media host sever 104, the subject of the media content, tags or labels associated with the media content and comments provided by an author of the media content. The media content database 330 further stores viewing information derived from providing the media content to content viewers 106 such as ratings of the media content provided by users, comments provided by the users, and the frequency at which the media content is viewed by the users. The media content database 330 also stores viewing information that is specific to a media content item such as a set of media content items that are commonly viewed in association with the media content item. [0027] The media content server 310 provides information and media content to the users. The media content server 310 retrieves media content from the media content database 330. The media content server 310 provides the retrieved media content to the content viewers 106. The media content server 310 further functions to retrieve and provide information and media content responsive to search queries received from the content viewers 106. The search queries may include criteria including search terms, class labels etc. The media content server 310 further retrieves items of related media content based in part on the class labels associated with a selected media content item and provides the related media content items to the content viewer 106. The media content server 310 further monitors viewing statistics and other viewing information associated with the media content such as the frequency at which the media content is viewed or and stores the viewing information to the media content database 330.
[0028] The content feature engine 312 generates content features based on the media content. Content features are metadata generated from the media content that can be used to characterize the media content. The content feature engine 312 generates content features specific to the media type of the media content. For still image content, content features may include: pixel intensity, luminosity, data derived from shape detection algorithms and other data derived from still images. For audio content, content features may include: pitch, tone, mel-frequency cepstral coefficients (MFC), and other data derived from audio content. For video content, content features may include data derived from shot detection algorithms, face detection algorithms, edge detection algorithms, and other data derived from video content, such as color, luminosity, texture and other features. The content feature engine 312 stores the generated content features in the feature set database 350. [0029] The text feature engine 308 generates text features based on the user-provided information describing the media content. The text feature engine 308 generates text features which comprise one or more tokens and a numeric value associated with the token, such as a frequency value. In one embodiment, the text feature engine 308 generates the text features by tokenizing the user provided information and determining the frequency of the tokens contained there. According to the embodiment, the text feature engine 308 may also stem the tokens or use lexicons to identify synonymous tokens prior to enumerating the frequency of the tokens. In some embodiments, the text feature engine 308 generates text features comprised of phrases such as noun phrases or verb phrases. The frequency information for the tokens can be raw frequency information, or normalized, such as TF-IDF or similar frequency measures.
[0030] In most embodiments, the text feature engine 308 generates the text features based on information describing the media content such as the title and summary associated with the media content. In other embodiments, the text feature engine 308 generates the text features from comments associated with the media content (e.g., as provided by users who view the media content item) and/or other sources of textual data referenced by the information describing the media content (e.g. web pages referenced in the summary associated with the media content). The text feature engine 308 further generates text features from video or image content using techniques such as speech recognition as applied to an audio track of a media content item, and optical character recognition (OCR) as applied to the images contained in a media content item.
[0031] The feature set database 350 stores feature sets for media content items in association with unique identifiers for the media content items. The feature sets include the text features generated by the text feature engine 308 and the content features generated by the content feature engine 312. The feature sets further include viewing statistics and other viewing information stored in the media content database 330 such as the frequency a media content item is viewed by users and a set of frequencies specifying the number of times other media items are viewed in association with the media content item, herein referred to as "co-watch metrics." These frequencies can be raw or normalized, as determined by the system administrator.
[0032] The similarity graph module 309 identifies neighbor media content items for media content items based on the feature sets. The similarity graph module 309 first generates a set of distance metrics which specify a measure of similarity between two media content items. Based on the distance metrics, the similarity graph module 309 identifies neighbor media content items.
[0033] The similarity graph module 309 generates a set of distance metrics based on the feature sets stored in association with the media content items in the feature set database 350. For each pair of feature sets in the feature set database 350 associated with a respective a first and second content item, the similarity graph module 309 generates a distance metric which indicates the similarity between the pair of feature sets for the items. In one embodiment, the distance metric can be a Euclidean distance metric generated based on corresponding features in the two feature sets. In other embodiments, the distance metric may be a correlation co-efficient between the corresponding features. The similarity graph module 309 may generate the distance metrics based on all of the features in the feature sets or a sub-portion of the features in the feature sets. In one embodiment, the similarity graph module 309 may generate the distance metric based on a specific type of feature in the feature sets. For instance, the similarity graph module 309 may generate the distance metric based only on viewing information such as co-watch metrics. The similarity graph module 309 stores the distance metrics in association with the feature sets and media content items in the media content database 330.
[0034] In some embodiments, the similarity graph module 309 filters the media content items in the feature set database 350, for example by removing the items from the feature set database 350 or flagging the media content items in the feature set database 350, prior to generating the set of distance metrics. In these embodiments, the similarity graph module 309 filters the media content items according to a set of specified features that indicate a media content item is to be filtered. In most embodiments, the set of specified features are features that indicate that the media content item is an item of undesirable content. In these embodiments, features that indicate that the media content item is an item of undesirable content are specified by an administrator of the media host serverlO4 and can include features that indicate that the media content item comprises spam, adult content, or hate speech.
[0035] The similarity graph module 309 identifies neighbor media content items based on the distance metrics associated with the media content items. For each target item, the similarity graph module 309 selects a set of neighbor items based on the distance metrics which have some measure of similarity to the target item. Suitable methods for selecting a set of neighbor media content items may include clustering the distance metrics. [0036] In one embodiment, the similarity graph module 309 selects a set of neighbor media content items by generating a similarity graph of the content items based on the distance metrics. The similarity graph module 309 generates a similarity graph containing a set of nodes, each of which represents a media content item in the feature set database 350. The similarity graph module 370 selects a node representing a media content item as the target node. The similarity graph module 370 attempts to assign some number N edges in the graph to connect the target node to N (e.g. 3<=N<=10) of nodes representing identified media content items with distance metrics indicating at least a minimum degree of similarity to media content item represented by the target node. For example, in one specific embodiment, the similarity graph module 309 connects each the target node to five (5) of the most similar media content items, based on their respective distance measure. The similarity graph module repeats this process by selecting each node as the target node and assigning edges between the target node and a set of nodes.
[0037] If similarity graph module is unable to identify N media content items with distance metrics indicating the minimum degree of similarity to the media content item represented by the target node, the similarity graph module 309 connects the target node to the maximum number of media items with distance metrics indicating at least a minimum degree of similarity to the media content item represented by the target node. If the similarity graph module is unable to find any media content items with distance metrics indicating the minimum degree of similarity to the media content item represented by the target node, the similarity graph module 309 connects the target node representing the media content item to a media item with a distance metric indicating a maximum similarity of all the distance metrics associated with the target node.
[0038] In some embodiments, the similarity graph module 309 prunes the similarity graph after the similarity graph is constructed using one or more pruning criteria. In these embodiments, the similarity graph module 309 may remove media content items based on viewing statistics that indicates which media content items in the graph are not actively viewed by content viewers 106. In these embodiment, the viewing statistics that indicates that the media content item is not actively viewed on the by content viewers 106 are specified by an administrator of the media host service 104 and may include the statistics as the views, ratings, or comments associated with the media content items; these statistics can include raw or normalized counts (e.g. number of views of the media content item), rates (e.g., weekly rate of comment postings), trends (e.g., average weekly percent change in number of views), velocities (number of unique viewers in the last hour), or distributions (e.g., number or percentage of users giving each level of rating value), or the like. In addition, items can be pruned base on their co-watch metrics. For a given target item, the neighbor content items with the least significant (e.g., lowest valued) co-watch metrics can be pruned. The above pruning criteria are applied to the neighbor content items for each target node until the nodes have been examined according the criteria. These pruning criteria can be applied in any order desired by the system administrator. [0039] The similarity graph module 309 identifies a set of neighbor media content items for each media content item based on the similarity graph. In most embodiments, the similarity graph module 309 identifies a set of neighbor media content items comprising a specified M number of media content items (e.g. 3<=M<=10 media content items). In most instances, the M number of media content items is equal to the N number of edges connecting each target node to items of media content. If a target node is connected to M or greater nodes, the similarity graph module 309 selects the media content items with distance metrics indicating the highest similarity to the target media item as the set of neighbor media items.
[0040] The similarity graph module 370 traverses the similarity graph to identify neighbor media content items. The similarity graph module 370 traverses the similarity graph by selecting a neighbor node representing a neighbor media content item with a distance metric indicating the highest similarity to the target media content item. The similarity graph module 370 then selects a node connected to the neighbor node with a distance metric that indicates a highest similarity to the neighbor node as a neighbor media content item. The similarity graph module 370 continues this process until all the number of neighbor media content items equals the specified number of neighbor media content items. [0041] In summary, the foregoing processes of identifying neighbor content items, generating a similarity graph, and pruning the similarity graph provide a robust set of content items that have a very high likelihood of being meaningfully related to each, based both on their intrinsic features as well as the extrinsic behaviors of the users who view the items. Using the pruning criteria to prune the similarity graph leverages behavioral information from the user community as to which content items are of sufficient interest and are sufficiently related to each other.
[0042] The data aggregation module 314 generates supplemented feature sets for each media content item based on the set of neighbor media content items associated with the media content item. The data aggregation module 314 combines the feature sets to generate a supplemented feature set. The data aggregation module 314 generates the supplemented feature sets based on all (or a portion of) the data in the feature sets of all (or a selected subset of) the neighbor items. In one embodiment, the supplemented feature sets are generated based only on text features associated with the media content items. [0043] The data aggregation module 314 combines the feature sets associated with the target media content item and neighbor media content items to generate a supplemented feature set. In one embodiment, the data aggregation module 314 combines the feature sets by simply merging the feature sets to generate a combined, unordered, and un-weighted feature set containing all of the features of all of the neighbor items. Alternatively, the data aggregation module 314 merges features which occur in both data sets by adding, averaging, or otherwise mathematically combining the values associated with the features, such that for each type of feature, there is a single value (or set of values) as appropriate for the feature. For example, for a color feature, the data aggregation module 314 may produce an average color histogram from the color histograms of the neighbor content items (hence a set of frequency counts for color bins), whereas for a ratings feature, the data aggregation module 314 may produce a single average rating from the neighbor content items. In other embodiments, the data aggregation module 314 combines the feature sets associated with the neighbor media content items by weighting the feature sets based on the similarity values of the neighbor media content items.
[0044] In some embodiments, the data aggregation module 314 uses consensus methods to identify features in the feature sets associated with the neighbor media content items to add to the supplemented feature set associated with the target media content item. In these embodiments, the data aggregation module 314 identifies features which have a range of values in the majority or percentage of the feature sets associated with the neighbor media content items. For example, the data aggregation module 314 can identify that a feature 'average volume' has a narrow range of values (e.g. each value is 9 or 10 on a scale of 1-10) in more than 80% of the feature sets associated with neighbor items of media content. The data aggregation module 214 may determine to add an average value of the feature "average volume' (i.e. 9.5 out of 10) to the supplemented feature set. [0045] The classification engine 312 classifies each media content item based on the supplemented feature set associated with the media content item. The classification engine 312 applies one or more classifiers 322 to the supplemented features sets to generate a set of classification scores indicating the likelihood that the media content item belongs to a class or category of media content items. The classification engine 312 assigns a set of one or more class labels to the media content item based the classification score indicating the likelihood that the media content item belongs to a class or category of media content items exceeding a defined threshold value. For example, an media content item may be assigned a label of "soccer" responsive to a classification score that the media content item belongs to the class "soccer" being greater than 90%. The classification engine 312 stores set of class labels in association with the media content item in the classified media corpus 380. [0046] According to the embodiment, the classifier 322 may be generated by the classification engine 312 or received from another source. In one embodiment, the classifier 322 is a single multi-class classifier which is trained on corpus of content items which are classified according to a hierarchical system of classification. In a specific embodiment, the single multi-class classifier is trained on a corpus of content items which are classified according to the hierarchical system of classifications used by the Open Directory Project (ODP). In this embodiment, the training set of media content items is manually classified, so that each training media content item has one or more labels from the OPD. The training set of media content items is then processed for their features and viewing statistics and the classifier 322 is constructed and validated using the training set of media content items and a corresponding validation set. In alternate embodiments, the classifiers 322 may be binary classifiers and the classes may be non-hierarchical. In some embodiments, the classifier 322 is re-trained using the classified media corpus 380.
[0047] In one embodiment, after the initial classification of a content item by the classifier 322, a second stage of classification is provided by the class label engine 315. More specifically, the class label engine 315 refines the set of class labels associated with a target media content item based on the class labels of its neighbor media content items. The class label engine 315 obtains the set of class labels associated with the target media content item in the classified media corpus 380. For each class label associated with the media content item, the class label engine 315 determines a class consensus value specifying the number or percentage of the neighbor media content items that are also associated with the class label in the classified media corpus 380. If the class consensus value is below a threshold value, the class label engine 315 removes the class label from the set of labels associated with the target media content item in the classified media corpus 380.
[0048] For example, for a class label "soccer" associated with target media content item, the class label engine 315 can identify that the class label "soccer" is associated with 5 out of 6 neighbor media content items of the target media content item in the classified media corpus 380. The class label engine 315 can then determine that the class consensus value of, for example 83%, associated with the label is greater than a threshold value of 33% and retain the class label "soccer" in association with the target media content item. Conversely, if the class label engine 315 identifies that 0 out of 6 of the neighbor media items are associated with a class label "religious", then the class label engine 315 can determine that the corresponding class consensus value of 0% is less than a threshold value of 33% and remove the class label "religious" from the set of class labels associated with the target media content item.
[0049] According to the embodiment, the threshold consensus value may be specified, for example by an administrator of the media host service 104, or determined by the class label engine 315. The class label engine 315 may determine the threshold value based on many factors. In some embodiments, the threshold value is dependent on a level of specificity associated with the class label defined by the hierarchical classification scheme. For instance, the threshold value for a label which specifies "soccer" may be a smaller value than the threshold value for a label which specifies "sports". In some embodiments, the threshold value is dependent on the relative frequency with which a label occurs in a corpus. For instance, based on a corpus in which the frequency of the label "soccer" is 5 times greater than the frequency of the label "lawn bowling", the threshold value for a label which specifies "soccer" and the threshold value for a label which specifies "lawn bowling" may be proportional to their frequency in the corpus with the threshold value for "soccer" being 5 times greater than the threshold value for "lawn bowling".
[0050] FIG. 4 is a flowchart illustrating steps performed by the media host service 104 to identify a set of neighbor media content items for a media content item in accordance with an embodiment of the present invention. Other embodiments perform the illustrated steps in different orders, and/or perform different or additional steps. Moreover, some of the steps can be performed by engines or modules other than the media host service 104. [0051] The media host service 104 identifies 404 a set of feature sets associated with a set of media items. The media host service 104 filters 406 the set of feature sets associated with set of media items. The media host service 104 generates 408 a set of distance metrics, each distance metric specifying the similarity of the feature sets associated with pairs of media items. The media host service 104 generates 410 a similarity graph based on the set of distance metrics. The media host service 104 prunes 412 the similarity graph to remove media content items. The media host service 104 identifies 414 a set of neighbor media content items for each media content item in the similarity graph.
[0052] FIG. 5 is a flowchart illustrating steps performed by the media host service 104 to classify a media content item in accordance with an embodiment of the present invention. Other embodiments perform the illustrated steps in different orders, and/or perform different or additional steps. Moreover, some of the steps can be performed by engines or modules other than the media host service 104.
[0053] The media host service 104 identifies 512 a set of neighbor media content items for the target media content item. The media host service 104 generates 514 a supplemented feature set for the target media content item based on the sets of feature data associated with the set of neighbor media content items. The media host service 104 generates 516 a set of class labels associated with the target media content item.
[0054] FIG. 6 is a flowchart illustrating steps performed by the media host service 104 to refine a set of class labels associate with a target node 322 in accordance with an embodiment of the present invention. Other embodiments perform the illustrated steps in different orders, and/or perform different or additional steps. Moreover, some of the steps can be performed by engines or modules other than the media host service 104. [0055] The media host service 104 identifies 610 a set of class labels associated with a target node. The media host service 104 determines 612 a class consensus value for each class label based on the class labels associated with the neighbor media content items for the target node. The media host service 104 removes 614 the class labels having a class consensus value beneath a threshold value from the set of class labels associated with the target node.
[0056] The present invention has been described in particular detail with respect to a limited number of embodiments. Those of skill in the art will appreciate that the invention may additionally be practiced in other embodiments. First, the particular naming of the components, capitalization of terms, the attributes, data structures, or any other programming or structural aspect is not mandatory or significant, and the mechanisms that implement the invention or its features may have different names, formats, or protocols. Further, the system may be implemented via a combination of hardware and software, as described, or entirely in hardware elements. Also, the particular division of functionality between the various system components described herein is merely exemplary, and not mandatory; functions performed by a single system component may instead be performed by multiple components, and functions performed by multiple components may instead performed by a single component. For example, the particular functions of the media host service may be provided in many or one module.
[0057] Some portions of the above description present the feature of the present invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the art to most effectively convey the substance of their work to others skilled in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules or code devices, without loss of generality.
[0058] It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the present discussion, it is appreciated that throughout the description, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
[0059] Certain aspects of the present invention include process steps and instructions described herein in the form of an algorithm. All such process steps, instructions or algorithms are executed by computing devices that include some form of processing unit (e.g,. a microprocessor, microcontroller, dedicated logic circuit or the like) as well as a memory (RAM, ROM, or the like), and input/output devices as appropriate for receiving or providing data.
[0060] The present invention also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer, in which event the general-purpose computer is structurally and functionally equivalent to a specific computer dedicated to performing the functions and operations described herein. A computer program that embodies computer executable data (e.g. program code and data) is stored in a tangible computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for persistently storing electronically coded instructions. It should be further noted that such computer programs by nature of their existence as data stored in a physical medium by alterations of such medium, such as alterations or variations in the physical structure and/or properties (e.g., electrical, optical, mechanical, magnetic, chemical properties) of the medium, are not abstract ideas or concepts or representations per se, but instead are physical artifacts produced by physical processes that transform a physical medium from one state to another state (e.g., a change in the electrical charge, or a change in magnetic polarity) in order to persistently store the computer program in the medium. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability. [0061] Finally, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.

Claims

1. A computer-implemented method of generating a set of class labels for labeling an item, the method comprising: identifying a set of neighbor items associated with a first item based, in part, on a first feature set associated with the first item, wherein each neighbor item of the set of neighbor items is associated with a feature set; generating a supplemented feature set for the first item based on the identified set of neighbor items responsive to combining the first feature set and the features sets associated with the set of neighbor items; and generating a set of classification scores associated the first item based on the supplemented feature set, each classification score of the set of classification scores indicating a likelihood that the first item belongs to a class of items.
2. The method of claim 1, wherein the first item is a media content item and the neighbor items are media content items.
3. The method of claim 2, wherein the first feature set and the feature sets associated with the neighbor items include viewing statistics derived from providing the media content item to viewers.
4. The method of claim 2, wherein the first feature set and the feature sets associated with the neighbor items include information describing the media content item specified by a user of a media host system.
5. The method of claim 2, wherein the first feature set and the feature sets associated with the neighbor items include information generated from the media content items.
6. The method of claim 1, wherein identifying the set of neighbor items associated with a first item comprises: determining a set of distance metrics based on the first feature set and the feature sets associated with the set of neighbor items; and identifying the set of neighbor items based, in part, on the set of distance metrics.
7. The method of claim 6, wherein determining a set of distance metrics based on the first feature set and the feature sets associated with the set of neighbor items comprises: identifying a set of items associated with feature sets; identifying at least a first undesirable item of the set of items based on the features sets associated with the set of items, wherein the feature set associated with the at least a first undesirable item comprises a feature that is specified by an administrator to indicate that the item is an undesirable item; generating a filtered set of items responsive to removing the at least a first undesirable item from the set of items; and determining a set of distance metrics based on the filtered set of items.
8. The method of claim 6, wherein identifying the set of neighbor items based, in part, on the set of distance metrics comprises: generating a similarity graph based on the set of distance metrics, wherein the similarity graph is comprised of nodes representing the items; and identifying the set of neighbor items based on the similarity graph.
9. The method of claim 8, wherein the item is a media content item, the neighbor items are media content items and identifying the set of neighbor items based, in part, on the similarity graph comprises: identifying at least a first node in the similarity graph representing an item of media associated with feature set indicating that the media content item is associated with one or more viewing statistics beneath a threshold value; generating a pruned similarity graph responsive to removing the at least a first node; and identifying the set of neighbor items based on the pruned similarity graph.
10. The method of claim 1, wherein combining the first feature set with the features sets associated with the neighbor items comprises: aggregating the first feature set with the feature sets associated with the neighbor items.
11. The method of claim 1, further comprising: generating a first set of class labels associated with the first media content item responsive to one or more classification scores of the set of classification scores exceeding a threshold value.
12. The method of claim 11, further comprising: identifying one or more sets of class labels associated with the set of neighbor items, wherein each neighbor items is associated with a set of class labels; generating a plurality of class consensus scores based on the first set of class labels associated with the first item and the one or more sets of class labels associated with the set of neighbor items, wherein each class consensus score indicates a correspondence between a class label associated with the first item and the neighbor items; generating a refined set of class labels associated with the first item responsive to removing at least one class label from the set of class labels associated with the first item based on a class consensus score associated with the at least one class label; and storing the set of refined class labels.
PCT/US2010/027709 2009-04-08 2010-03-17 Similarity-based feature set supplementation for classification WO2010117581A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP10762077.5A EP2417544A4 (en) 2009-04-08 2010-03-17 Similarity-based feature set supplementation for classification
CN2010800220637A CN102428467A (en) 2009-04-08 2010-03-17 Similarity-Based Feature Set Supplementation For Classification
CA2757771A CA2757771A1 (en) 2009-04-08 2010-03-17 Similarity-based feature set supplementation for classification

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16782509P 2009-04-08 2009-04-08
US61/167,825 2009-04-08

Publications (1)

Publication Number Publication Date
WO2010117581A1 true WO2010117581A1 (en) 2010-10-14

Family

ID=42936489

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2010/027709 WO2010117581A1 (en) 2009-04-08 2010-03-17 Similarity-based feature set supplementation for classification

Country Status (4)

Country Link
EP (1) EP2417544A4 (en)
CN (1) CN102428467A (en)
CA (1) CA2757771A1 (en)
WO (1) WO2010117581A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933191A (en) * 2015-07-09 2015-09-23 广东欧珀移动通信有限公司 Spam comment recognition method and system based on Bayesian algorithm and terminal
US9659014B1 (en) * 2013-05-01 2017-05-23 Google Inc. Audio and video matching using a hybrid of fingerprinting and content based classification
WO2022079482A1 (en) * 2020-10-14 2022-04-21 Coupang Corp. Systems and methods for database reconciliation
CN114896963A (en) * 2022-07-08 2022-08-12 北京百炼智能科技有限公司 Data processing method and device, electronic equipment and storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239351B (en) * 2013-06-20 2017-12-19 阿里巴巴集团控股有限公司 A kind of training method and device of the machine learning model of user behavior
EP2824589A1 (en) * 2013-07-12 2015-01-14 Thomson Licensing Method for enriching a multimedia content, and corresponding device.
EP3049962B1 (en) * 2013-09-27 2019-10-23 Intel Corporation Mechanism for facilitating dynamic and proactive data management for computing devices
CN107430633B (en) * 2015-11-03 2021-05-14 慧与发展有限责任合伙企业 System and method for data storage and computer readable medium
CN105608352B (en) * 2015-12-31 2019-06-25 联想(北京)有限公司 A kind of information processing method and server
CN107038193B (en) * 2016-11-17 2020-11-27 创新先进技术有限公司 Text information processing method and device
US11869055B2 (en) 2021-01-28 2024-01-09 Maplebear Inc. Identifying items offered by an online concierge system for a received query based on a graph identifying relationships between items and attributes of the items

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6990628B1 (en) 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents
US20060111968A1 (en) * 2004-11-19 2006-05-25 Image Impact, Inc. Method and system for valuing advertising content
US20070112709A1 (en) * 2005-10-31 2007-05-17 Huitao Luo Enhanced classification of marginal instances
US20070196013A1 (en) * 2006-02-21 2007-08-23 Microsoft Corporation Automatic classification of photographs and graphics
US20080114564A1 (en) * 2004-11-25 2008-05-15 Masayoshi Ihara Information Classifying Device, Information Classifying Method, Information Classifying Program, Information Classifying System

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1265308C (en) * 1998-09-29 2006-07-19 英业达股份有限公司 Automatic inquiry system and method
CN1196071C (en) * 2000-05-31 2005-04-06 三星电子株式会社 Database structuring method for multimedia contents
GB2393271A (en) * 2002-09-19 2004-03-24 Sony Uk Ltd Information storage and retrieval
JP4972358B2 (en) * 2006-07-19 2012-07-11 株式会社リコー Document search apparatus, document search method, document search program, and recording medium.
CN101196905A (en) * 2007-12-05 2008-06-11 覃征 Intelligent pattern searching method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6990628B1 (en) 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents
US20060111968A1 (en) * 2004-11-19 2006-05-25 Image Impact, Inc. Method and system for valuing advertising content
US20080114564A1 (en) * 2004-11-25 2008-05-15 Masayoshi Ihara Information Classifying Device, Information Classifying Method, Information Classifying Program, Information Classifying System
US20070112709A1 (en) * 2005-10-31 2007-05-17 Huitao Luo Enhanced classification of marginal instances
US20070196013A1 (en) * 2006-02-21 2007-08-23 Microsoft Corporation Automatic classification of photographs and graphics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2417544A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9659014B1 (en) * 2013-05-01 2017-05-23 Google Inc. Audio and video matching using a hybrid of fingerprinting and content based classification
CN104933191A (en) * 2015-07-09 2015-09-23 广东欧珀移动通信有限公司 Spam comment recognition method and system based on Bayesian algorithm and terminal
WO2022079482A1 (en) * 2020-10-14 2022-04-21 Coupang Corp. Systems and methods for database reconciliation
US11775565B2 (en) 2020-10-14 2023-10-03 Coupang Corp. Systems and methods for database reconciliation
CN114896963A (en) * 2022-07-08 2022-08-12 北京百炼智能科技有限公司 Data processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CA2757771A1 (en) 2010-10-14
EP2417544A1 (en) 2012-02-15
CN102428467A (en) 2012-04-25
EP2417544A4 (en) 2013-10-02

Similar Documents

Publication Publication Date Title
US11693902B2 (en) Relevance-based image selection
WO2010117581A1 (en) Similarity-based feature set supplementation for classification
US20220035827A1 (en) Tag selection and recommendation to a user of a content hosting service
US9087297B1 (en) Accurate video concept recognition via classifier combination
US20140201180A1 (en) Intelligent Supplemental Search Engine Optimization
WO2017070656A1 (en) Video content retrieval system
US10318543B1 (en) Obtaining and enhancing metadata for content items
US20130232154A1 (en) Social network message categorization systems and methods
TWI449410B (en) Personalized Sorting Method of Internet Audio and Video Data
WO2016000555A1 (en) Methods and systems for recommending social network-based content and news
US20160188633A1 (en) A method and apparatus for tracking microblog messages for relevancy to an entity identifiable by an associated text and an image
EP2638509A1 (en) Learning tags for video annotation using latent subtags
TWI571756B (en) Methods and systems for analyzing reading log and documents corresponding thereof
CN108446333B (en) Big data text mining processing system and method thereof
CN112733006B (en) User portrait generation method, device and equipment and storage medium
US20150227515A1 (en) Robust stream filtering based on reference document
CN117493645B (en) Big data-based electronic archive recommendation system
US20230214679A1 (en) Extracting and classifying entities from digital content items
CN117493645A (en) Big data-based electronic archive recommendation system
EP3103030A1 (en) Robust stream filtering based on reference documents

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201080022063.7

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10762077

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2757771

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2010762077

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2010762077

Country of ref document: EP