Suche Bilder Maps Play YouTube News Gmail Drive Mehr »
Anmelden
Nutzer von Screenreadern: Klicke auf diesen Link, um die Bedienungshilfen zu aktivieren. Dieser Modus bietet die gleichen Grundfunktionen, funktioniert aber besser mit deinem Reader.

Patentsuche

  1. Erweiterte Patentsuche
VeröffentlichungsnummerCN104317867 A
PublikationstypAnmeldung
AnmeldenummerCN 201410554684
Veröffentlichungsdatum28. Jan. 2015
Eingetragen17. Okt. 2014
Prioritätsdatum17. Okt. 2014
Veröffentlichungsnummer201410554684.X, CN 104317867 A, CN 104317867A, CN 201410554684, CN-A-104317867, CN104317867 A, CN104317867A, CN201410554684, CN201410554684.X
Erfinder朱其立, 赵凯祺, 蔡智源, 隋清宇, 魏恩勋
Antragsteller上海交通大学
Zitat exportierenBiBTeX, EndNote, RefMan
Externe Links:  SIPO, Espacenet
System for carrying out entity clustering on web pictures returned by search engine
CN 104317867 A
Zusammenfassung
The invention relates to a system for carrying out entity clustering on web pictures returned by a search engine. The system comprises an offline system and an online system, wherein the offline system is used for preprocessing a source webpage in which all pictures are stored, the online system is used for receiving the inquiry, submitting the inquiry to the search engine and receiving multiple pages of returned picture results, concept element data and text of the source webpage are found for each page of returned results, an inquiry context and a picture context are extracted from the concept text, the online system carries out the three-layer clustering on the element data, the context and the expanded context after the context is expanded in a concept manner, a relevant descriptive concept is automatically marked for each category so as to know the entity of each category. The three-layer clustering algorithm has identical time complexity with an ordinary layering clustering algorithm; by subdividing the characteristics, more precision in the input of each layer, i.e. the output of a previous layer can be realized, the clustering effect can be effectively improved, and an accurate descriptive concept can be provided.
Ansprüche(8)  übersetzt aus folgender Sprache: Chinesisch
1. 一种对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,包括离线系统和在线系统,其中: 离线系统,用于对所有图片所在的源网页进行预处理,包括抽取网页元数据,把原网页文本和元数据概念化成一组带权概念的集合,即,概念向量,概念化后的元数据和网页内容供在线系统查询使用; 在线系统,用于接收查询,提交到搜索引擎并接收返回的多页图片结果,对于每一个页的返回结果,找到源网页的概念化元数据和文本,并在概念化的文本中抽取查询关键词的上下文以及图片上下文,在线系统分别利用元数据,上下文,以及对上下文进行概念扩展后的扩展上下文进行三层聚类,并为每一个类别自动标注相关的描述性概念,以了解每一个类别的实体。 A search engine returns pages image entity clustering system comprising offline systems and online systems, including: off-line systems, for all the pictures on the page where the source pretreatment, the taking of the page metadata, the original webpage text and metadata with the right set of concepts into the concept of a group, that is, the concept of vector, metadata conceptualization and post web content for the online system query uses; on-line system for receiving queries submitted to search engine and multi-page image receives the returned result, for each page of results returned, find the source page conceptualization metadata and text, and extract query keywords in the text conceptualized context and picture context, the use of online systems are metadata The context, as well as expand the conceptual context expanded contexts three clusters, and for each category automatic annotation associated descriptive concepts to understand each category of the entity.
2. 根据权利要求1所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,所述离线系统进行元数据抽取,包括对URL中有效词条的抽取,图片ALT属性,其中对URL有效词条的抽取,是利用二类分类器对有效和无效词条进行分类,并返回有效词条。 2. Perform physical clustering system based on the search engine returns pages pictures as claimed in claim 1, characterized in that the off-line system metadata extraction, including the URL of the effective term extraction, image ALT attributes, which the URL of a valid entry drawn is the use of two types of classifier classification of valid and invalid entries, and returns a valid entry.
3. 根据权利要求1所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,所述离线系统包括概念化模块,用于对上下文进行概念扩展,文本通过概念化模块,转换成带权概念的集合,每个概念的权值为该概念对图片的重要性,其定义如下: |D| CF-IDF(c,d) =CF(c,d)x\og-^-^ 其中,CF-IDF(c,d)为概念c对图片d的重要性,包括两部分的乘积:概念在图片上下文出现的频率CF(c,d),以及反向上下文频率,其中反向上下文频率反比于概念出现过的上下文的数量DF(C),D为所有图片的上下文的集合。 3. The search engine returns pages image according to claim 1 carried out physical clustering system, wherein said system comprises a conceptualization off module, extended conceptual context for the text module through conceptualization converted into with the right set of concepts, the right of each concept is the concept of the importance of image, which is defined as follows: | D | CF-IDF (c, d) = CF (c, d) x \ og - ^ - ^ wherein, CF-IDF (c, d) for the concept of the importance of c d picture, comprising the product of two parts: the context of the concept of the picture appears frequency CF (c, d), and the inverse context frequency, wherein the reverse context frequency is inversely proportional to the concept appeared in the context of the number of DF (C), D is the set of all the pictures context.
4. 根据权利要求1所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,在线系统包括文本上下文抽取模块,用于对所输入的查询关键词,抽取其概念化查询上下文和图片上下文。 4. According to the search engine returns pages pictures of claim 1 carried entity clustering system, characterized in that the online system include text context extraction module for the input query keywords, extracts its conceptualization query context and image context.
5. 根据权利要求4所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,所述在线系统包含三层聚类算法模块,该模块根据抽取的元数据,上下文,以及扩展的上下文三类特征从置信度最高的元数据,到上下文,到扩展上下文进行三个层次的聚类,其中: 第一层聚类,通过元数据概念化后的概念向量进行聚合层次聚类,获得类内精度高的聚类结果,并且合并每个类里所有图片的概念向量作为类的概念向量; 第二层聚类,向每个图片的概念向量中加入概念化上下文的概念向量,更新所有第一层聚类后得到的类的概念向量,并进一步对这些得到的类进行聚合层次聚类; 第三层聚类,把每个图片的向量替换成扩展的概念向量,更新所有第二层聚类后得到的类的概念向量,并进一步对这些概念向量进行聚合层次聚类。 5. According to the search engine returns pages pictures of claim 4, wherein conduct physical clustering system, wherein the system includes a three-line clustering algorithm module, the metadata extracted, context, and Extended features three highest confidence context metadata from, to context, to expand contexts three levels of clustering, wherein: the first layer clustering, hierarchical clustering by polymerizing concept vector metadata conceptualization after obtain high accuracy in the class clustering results, and merge all images in each class concept vector vector class as a concept; a second layer of clustering, to the concept of vector images added each concept vector conceptualization context, update all The first layer was clustering concept vector obtained class, and those obtained by further polymerizing hierarchical clustering classes; clustering the third layer, the replacement of a vector of each image expanded concept vector, to update all of the second layer After class clustering concept vector obtained and further polymerization of these concepts vectors hierarchical clustering.
6. 根据权利要求5所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,所使用的聚合层次聚类算法利用类的概念化进行类的相似度计算,类的概念化通过把类中的图片的概念向量进行相加,并且去除向量中值比较低的概念,得到高精度的类概念, 类的概念化用如下公式定义: 6. According to the search engine returns pages pictures of claim 5 carried entity clustering system, wherein the polymeric conceptualization hierarchical clustering algorithm uses the classes that are used to calculate the similarity class, the class conceptualization through The concept of the vector class image are summed, and the relatively low value of the vector to remove the concept, the concept of classes obtained with high accuracy, the class is defined by the following equation conceptualization:
Figure CN104317867AC00031
其中,C为概念,C为类,d为类中图片,CF-IDF(c,d)为概念对图片的重要性。 Wherein, C is the concept, C class, d is the class pictures, CF-IDF (c, d) for the importance of the concept of the picture.
7. 根据权利要求5所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,第三层聚类通过维基百科进行上下文的扩展,把图片的概念向量替换成扩展的概念向量,并目1更新毎个类的概念向量,更新定义为如下公式: The concept according to the search engine returns pages pictures of claim 5 carried entity clustering system, wherein the third layer clustering context expanded by Wikipedia, to substitute the extended concept vector images vector and mesh concept vector 1 update every class, the update is defined as the following formula:
Figure CN104317867AC00032
其中,CF-IDF〇,dCi)为概念c对概念Ci的维基百科描述页面的重要性,V。 Wherein, CF-IDF〇, dCi) the importance of the concept c concept Ci Wikipedia description page, V. 为当前类概念向量所有概念的集合,Ci为当前类概念向量中的概念,上下文扩展过程通过选取值最大的前k个概念对噪声数据进行过滤。 The concept for the collection of the current class vector all concepts, Ci is the current class concept vector concept context expansion process by selecting the maximum value of the first k concepts noise data filtering.
8. 根据权利要求1所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,利用所述三层聚类后得出的类概念向量给每个图片类标注相关的描述概念,选取每个类的概念向量中值最高的前几个概念用于描述该类所代表的实体。 8. According to the search engine returns pages pictures of claim 1 carried entity clustering system, characterized in that after the use of the three-tier clustering concept vector derived class to class label associated with each image description concept, the concept of vector select the highest value for each class the first few concepts used to describe the entity class represents.
Beschreibung  übersetzt aus folgender Sprache: Chinesisch
对搜索引擎返回的网页图片进行实体聚类的系统 The search engine returns pages image entity clustering system

技术领域 Technical Field

[0001] 本发明涉及计算机技术领域的自然语言处理,文本挖掘,具体地,涉及对搜索引擎返回的网页图片进行实体聚类的系统。 [0001] The present invention relates to the field of computer technology, natural language processing, text mining, in particular, to the search engine returns pages image entity clustering system.

背景技术 Background technique

[0002] 随着互联网的普及以及网页图片日益增长,网页图片搜索逐渐成为互联网用户的一大日常应用。 [0002] With the growing popularity of the Internet and web images, web image search is becoming a major internet users daily application. 目前的图片搜索引擎主要返回跟查询关键词相关的图片。 The image search engine currently returns with major query keywords related images. 而这些图片往往包含多个同名的实体。 These pictures often contain more than one entity with the same name. 用户需要从搜索结果中找到所要的图片,需要浏览查看每张返回的图片。 Users need to find a desired image from the search results, you need to view each picture returned. 为了提高搜索结果的可读性,按照不同实体区分搜索结果成为了图像搜索引擎的一个改良反向。 In order to improve the readability of the search results, in accordance with the different entities distinguish search results became a modified reverse image search engine.

[0003] 图像聚类是自动区分不同实体的方法。 [0003] The image clustering method is automatically distinguish between different entities. 在过去的研究中,D.Cai(参见Cai, D. , He, X. , Ma, ff. Y. , Wen, JR , Zhang, H. : Organizing www images based on the analysis of page layout and web link structure. ICME 2004)利用基于视觉的分块的方式抽取网页图片的上下文,并且利用该上下文和网页链接信息进行聚类。 In past studies, D.Cai (see Cai, D., He, X., Ma, ff Y., Wen, JR, Zhang, H.:. Organizing www images based on the analysis of page layout and web link structure. ICME 2004) Web Images by decimation block-based way of visual context, and using the contextual information and links to cluster. 然而由于视觉分块的不稳定,以及上下文中的噪声数据,聚类的精度有很大的限制;Z. Fu(参见Fu, Z.,Ip, HHS,Lu, H.,Lu, Z. :Multi-modal constraint propagation for heterogeneous image clustering. MultiMedia 2011)提供了一种结合照图像的标签和图像的视觉特征等多个模块的框架,在多个图上通过传递类的约束来实现图像聚类。 However, due to the instability of the visual block, and in the context of noise data, clustering accuracy is very limited; Z Fu (see Fu, Z., Ip, HHS, Lu, H., Lu, Z..: Multi-modal constraint propagation for heterogeneous image clustering. MultiMedia 2011) provides a framework for a combination of a plurality of modules according to the image of the label and the image of the visual characteristics, on multiple plans to achieve image clustering by passing the class constraints. 目前视觉特征的抽取精度的不足,该框架会传播视觉特征所包含的错误。 Current lack of visual feature extraction accuracy, the framework will propagate error visual features included. 而且,该方法需要在多个图中进行约束传递,导致聚类效率低下,不适合于对在线图片搜索结果的聚类。 Moreover, the method needs to be constrained to pass in multiple graphs, resulting in inefficient cluster, not suitable for online image search results clustering. 目前的图像聚类方法并不能提供描述性的概念去给每一个类进行标注。 The current image clustering method does not provide a description of the concept of going to each class label.

发明内容 DISCLOSURE

[0004] 本发明针对现有技术中的不足,提供了一个对搜索引擎返回的网页图片进行实体聚类的系统,使得图片搜索结果更好地按照不同实体组织起来,并且每个实体类具有高精度,不同实体之间具有明显的区分度。 [0004] The present invention addresses the deficiencies in the prior art, the system provides a search engine returns pages image clustering entity makes image search results better organized in accordance with the different entities, and each entity class with high precision, with a clear discrimination between different entities. 本发明把整个框架分成了在线和离线两个部分,大大减小了在线聚类的时间开销。 The entire framework of the present invention into the online and offline in two parts, greatly reducing the time overhead line clustering.

[0005] 为达到上述目的,本发明所采用的技术方案如下: [0005] To achieve the above object, the technical scheme of the present invention is used as follows:

[0006] -种对搜索引擎返回的网页图片进行实体聚类的系统,包括离线系统和在线系统两部分,其中: [0006] - the kind of search engine returns pages image clustering system entities, including offline systems and online systems in two parts, in which:

[0007] 离线系统,用于对所有图片所在的源网页进行预处理,包括抽取网页元数据,把原网页文本和元数据概念化成一组带权概念的集合(概念向量)。 [0007] offline system for all pictures where the source page preprocessing, including extracting web page metadata, the original web page text and metadata concepts into a set of weighted concepts set (concept vector). 概念化后的元数据和网页内容供在线系统查询使用。 Metadata and web content for the online system after conceptualization queries.

[0008] 在线系统,用于接收查询,提交到搜索引擎并接收返回的多页图片结果,对于每一个页的返回结果,找到源网页的概念化元数据和文本,并在概念化的文本中抽取查询关键词的上下文(查询上下文)以及图片上下文,在线系统分别利用元数据,上下文,以及通过维基百科对上下文进行概念扩展后的扩展上下文进行三层聚类,并为每一个类别自动标注相关的描述性概念,以了解每一个类别的实体。 [0008] The online system for receiving queries submitted to a search engine and receive the return of multi-page picture results, for each page of returned results page to find the source of conceptualization metadata and text, and extract text query conceptualization Key words of context (query context) as well as pictures context, the use of online systems are metadata, context, and context by Wikipedia extend the concept of the extended context were three clusters, and automatic annotation for each category related description concept, in order to understand each category of entity.

[0009] 所述离线系统进行元数据抽取,包括对URL中有效词条的抽取,图片ALT属性,对URL有效词条的抽取,利用二类分类器对有效和无效词条进行分类,并返回有效词条。 [0009] The meta data extraction offline systems, including valid entry URL extraction, image ALT attributes, valid entries for URL extraction, the use of two types of classification for classifying valid and invalid entries and return valid entries. 图片ALT属性可以直接从HTML源代码获得。 Image ALT attributes can be obtained directly from the HTML source code.

[0010] 所述离线系统包括概念化模块,包括对元数据和图片原网页文本的概念化,概念化通过把元数据和文本中的词映射到维基百科的概念上,使元数据和文本转化成带权概念的集合,以计算相似度,供聚类算法使用,每个概念的权值为该概念对图片的重要性,其定义如下: [0010] The off-line systems include conceptualization modules, including metadata and image of the original webpage text conceptualization, conceptualized by the metadata and text in a word mapped to the Wikipedia concept that metadata and text into a weighted set concept to calculate the similarity for clustering algorithm, the weight of each concept is the importance of the concept of the picture, which is defined as follows:

[0011] [0011]

Figure CN104317867AD00051

[0012] 其中,CF-IDF(c,d)为概念C对图片d的重要性,包括两部分的乘积:概念在图片上下文出现的频率CF(c,d),以及反向上下文频率,其中反向上下文频率反比于概念出现过的上下文的数量DF (c)。 [0012] wherein, CF-IDF (c, d) the importance of the concept of image C d of a two-part product comprising: emerging concepts in the context of image frequency CF (c, d), and the inverse context frequency, wherein reverse frequency is inversely proportional to the context of the concept of the number of context appeared DF (c).

[0013] 所述在线系统包括文本上下文抽取模块,在已经概念化的原网页文本里抽取上下文信息,包括图片上下文的抽取和查询上下文的抽取,图片上下文和查询上下文皆通过一个固定大小的窗口截取,比如图片或者查询关键词前后50个概念,抽取的文本上下文形成一个概念向量,以用于计算图片相似度。 [0013] The online system includes text context extraction module extracts context information has been conceptualized in the original page text, including the extraction and query context extraction picture context, context, and query context pictures are by a fixed-size window to capture, such as pictures or query keywords around 50 concept, extracted from the text context vectors form a concept to be used to calculate the similarity image.

[0014] 所述在线系统包含三层聚类算法模块,包括元数据聚类,文本上下文聚类,以及上下文概念扩展聚类三个模块,其中: [0014] The on-line system clustering module comprising three layers, including metadata clustering, clustering text context, and the context of the concept of clustering three expansion modules, wherein:

[0015] 第一层聚类,通过元数据概念化后的概念向量进行聚合层次聚类,获得类内精度高的聚类结果,并且合并每个类里所有图片的概念向量作为类的概念向量。 [0015] The first layer clustering concept vector metadata conceptualization through post-polymerization hierarchical clustering to obtain high accuracy in the class clustering results, and incorporates the concept of vectors in each class all the pictures as a concept vector class.

[0016] 其中,聚合层次聚类算法利用类的概念化进行类的相似度计算。 [0016] wherein the polymerization hierarchical clustering algorithm uses the class conceptualization similarity class of computing. 类的概念化通过把类中的图片的概念向量进行相加,并且去除向量中值比较低的概念,得到高精度的类概念。 Class by class in the conceptualization of the concept vector images are summed, and removes a relatively low value of the vector concept to give high accuracy class concept. 类的概念化用如下公式定义: Conceptualization class defined by the following equation:

[0017] [0017]

Figure CN104317867AD00052

[0018] 其中,c为概念,C为类,d为类中图片,CF-IDF(c,d)为概念对图片的重要性。 [0018] wherein, c is the concept, C class, d is the class pictures, CF-IDF (c, d) for the importance of the concept of the picture.

[0019] 第二层聚类,向每个图片的概念向量中加入概念化上下文的概念向量,更新所有第一层聚类后得到的类的概念向量,并进一步对这些得到的类进行聚合层次聚类。 [0019] The second layer clustering, the concept vector for each picture added conceptualization context concept vector, updated concept vector obtained after all the first class level clustering, and further polymerizing these resulting class hierarchy poly class.

[0020] 第三层聚类,把每个图片的向量替换成扩展的概念向量,更新所有第二层聚类后得到的类的概念向量,并进一步对这些概念向量进行聚合层次聚类。 Concept vector class [0020] The third layer clustering, the vector for each image replacing the concept of extended vector, update all of the second layer obtained after clustering and further polymerization of these concepts vectors hierarchical clustering.

[0021] 其中,向量的扩展利用维基百科的概念描述页面,把相关的概念加入到图片的概念向量中,并且更新每个类的概念向量。 [0021] where the vector is the concept of extended use Wikipedia description page, the relevant concept is added to the image of the concept of vectors, and update the concept of vectors for each class. 其更新定义为如下公式: Its update is defined as the following formula:

[0022] [0022]

Figure CN104317867AD00053

[0023] 其中,rF-IDF(c,dCi)为概念c对概念Ci的维基百科描述页面的重要性,Ci为当前类概念向量中的概念,此上下文扩展过程通过选取值最大的前k个概念对噪声数据进行过滤。 [0023] where, rF-IDF (c, dCi) concept c concept Ci Wikipedia description page importance, Ci is the current concept of vector class concept in this context the expansion process by selecting the largest value before k concept of noise data filtering.

[0024] 用三层聚类后得出的类概念向量给每个图片类标注相关的描述概念:选取每个类的概念向量中值最高的前几个概念用于描述该类所代表的实体。 [0024] After three clustering concept vector derived class to class label associated with each picture describe the concept: the concept of vector values for each selected class of the highest in the first few conceptual entity used to describe the class represented by .

[0025] 本发明解决的技术问题包括: [0025] The technical problem to be addressed include:

[0026] 1.抽取图像上下文信息,并把上下文信息表示为概念空间中的向量,为图像相似度的计算提供特征。 [0026] 1. The context information extracted image, and the context information indicates the concept of the space vector, characterized by providing for the calculation of the image similarity.

[0027] 2.由于某些图像存在上下文信息量不足的情况,本发明提供一种扩展上下文信息的机制,把上下文的概念向量通过维基百科或者其他知识库进行扩展。 [0027] 2. As the context of insufficient information exists some images, the present invention provides a mechanism to extend the context information, the concept of vector context expanded by Wikipedia or other repository.

[0028] 3.由于不同的特征跟图片的相关度不同,相关度越高的特征的置信度越高,本发明为了有效利用不同相关度的特征来提高聚类的精度,依次对图片的概念向量进行扩展, 并且聚类。 [0028] 3. Due to the different characteristics of different correlation with the picture, the higher the higher the degree of confidence associated features, the present invention is characterized in order to effectively use the different affinities to improve the accuracy of clustering, followed by the concept of the picture vector extensions, and clustering.

[0029] 以下通过检索的相关现有技术与本发明进行的对比,来说明本发明的技术特征。 [0029] The following comparison by retrieving the relevant prior art and the present invention is to illustrate features of the present invention.

[0030] 相关检索1 : [0030] Related Search 1:

[0031] 申请(专利)号:2012101444570,名称:一种图片聚类的方法及装置 [0031] No. Application (patent): 2012101444570, Title: A method and apparatus for image clustering

[0032] 该专利文献通过对图片的视觉特征,包括全局特征以及局部特征进行了两次聚类,第二次聚类在第一次聚类的基础上进行切割。 [0032] This patent document images through the visual features, including global features and local features two clusters, the second cluster in clusters on the basis of the first cut.

[0033] 技术要点比较: [0033] Techniques comparison:

[0034] 1.该专利根据图片的内容,即视觉特征进行图片聚类,而本发明中利用图片上下文的特征进行聚类。 [0034] 1. The patent based on the content of the picture, that is, visual image clustering features, but the present invention is characterized in the context of the use of image clustering.

[0035] 2.该专利的二次聚类把大的类切割成小的类,而本发明从小的类聚合成大的类, 利用每次扩展概念向量进行特征的筛选,过滤噪声数据。 Second Cluster [0035] 2. This patented cut into small classes to large classes and small classes present invention polymerization into large classes, use each feature vector expansion of the concept of screening, filtering noise data.

[0036] 3.本发明米用的概念向量表不方式能为每一类标注描述概念,而基于图片内容的聚类方式无法提供概念描述。 [0036] 3. The concept made 明米 using vector mode can not describe the concept for each type of label, and content-based image clustering methods can not provide a conceptual description.

[0037] 相关检索2 : [0037] 2 related retrieval:

[0038] 申请(专利)号:2013106111554,名称:一种基于聚类紧凑特征的海量图像检索系统 [0038] No. Application (patent): 2013106111554, Title: A compact based on clustering feature massive image retrieval system

[0039] 该专利文献通过图像的局部特征对图像库中的图像进行聚类。 [0039] This patent is incorporated by local features of an image in the image gallery image clustering. 搜索的时候通过查询关键词先检索到图片聚类然后返回相应的图像。 When the first retrieved by the search query keywords to the image clusters and returns the appropriate image.

[0040] 技术要点比较: [0040] Techniques comparison:

[0041]1.该专利根据图片的局部特征生成聚类紧凑特征,进行图片聚类,而本发明中利用图片上下文的特征进行聚类。 [0041] 1 characterized in that patent generates a partial image of the cluster compact characteristics, perform image clustering, and context of the present invention, utilizing the characteristic image clustering.

[0042] 2.该专利通过图像聚类来提高检索的速度,而本发明通过把搜索结果进行聚类并概念化以提供区分各个类别的搜索结果。 [0042] 2. The patent by image clustering to improve the retrieval speed, while the present invention is to cluster search results and conceptualized by providing search results to distinguish between each category.

[0043] 相关检索3 : [0043] Related to retrieve 3:

[0044] 申请(专利)号:201210545637X,名称:一种基于分层聚类的均衡图像聚类方法 [0044] No. Application (patent): 201210545637X, name: a balanced image clustering method based on hierarchical clustering

[0045] 该专利文献利用图片聚类的方式减少搜索时所需要遍历的图片数量。 [0045] This patent document image clustering approach to reducing the use of search needs to traverse the number of pictures. 图片聚类基于图像高维特征数据。 Clustering high-dimensional image based on image feature data.

[0046] 技术要点比较: [0046] The techniques of comparison:

[0047] 1.该专利根据图片的高维特征,进行图片聚类,而本发明中利用图片上下文的特征进行聚类。 [0047] 1. The patent under the picture of the high-dimensional feature, perform image clustering, but the present invention is characterized in the context of the use of image clustering.

[0048] 2.该专利通过图像聚类减少检索时需要遍历的图片,采用的图像聚类方式是层次聚类,而本发明基于三种不同的上下文特征,通过三层聚类的方式提升聚类的精度。 [0048] 2. The patent clustering is decreased by the image retrieval needs to traverse the image, the image clustering method using the hierarchical clustering, and the present invention is based on three different contextual features, by way of the three-tier clustering upgrade poly accuracy class.

[0049] 相关检索4 : [0049] Related Terms Search 4:

[0050] 申请(专利)号:201210163641X,名称:图像聚类方法 [0050] No. Application (patent): 201210163641X, Title: Image Clustering Method

[0051] 该专利通过拍摄设备获取图像的时间数据和位置数据,并利用时间和位置以及速度数据作为特征进行聚类。 [0051] This patent acquisition time and position data of the image data capturing device, and use of time and location and velocity data as clustering feature.

[0052] 技术要点比较: [0052] Techniques comparison:

[0053] 1.该专利主要针对拍摄图像进行聚类,而本发明针对网页图片进行聚类。 [0053] 1. The patent focused on the captured image clustering, and this invention is directed to the page image clustering. 拍摄的图像没有上下文信息,而网页图片不一定是拍摄图像,大部分没有拍摄时间和位置。 Images taken no context information, and the page image is not necessarily a captured image, most have not taken the time and location. 两者的特征有所不同。 Two different characteristics.

[0054] 2.该专利基于事件序列进行聚类,而本发明基于概念向量。 [0054] 2. The patent-based clustering sequence of events and the present invention is based on the concept of vectors. 概念向量可以用于描述概念的生成。 Concept vectors can be used to describe the concept of generation.

[0055] 相关检索5 : [0055] 5 related retrieval:

[0056] 申请(专利)号:2009801523973,名称:使用基于内容的过滤和基于主题的聚类将图像布置到页面中 [0056] Application (patent) number: 2009801523973, Title: theme-based filtering and content-based clustering images into a page layout

[0057] 该专利基于设备捕获到的图片的内容,即视觉特征,按照不同的主题聚类,并且把聚类的结果映射到相应的相簿中。 [0057] This patent-based device to capture images of the contents of that visual features, according to a different theme clusters and the clustering results mapped to the corresponding album.

[0058] 技术要点比较: [0058] The techniques of comparison:

[0059]1.该专利利用图片的视觉特征聚类,而本发明利用网页图片的上下文进行聚类。 [0059] 1. The use of the image visual features patent clustering, but the present invention utilizes Web Images context clustering.

[0060] 2.该专利将图片通过图片布局到不同的页面上,而本发明为用户提供分类的搜索结果以及相应的描述概念。 [0060] 2. The patent picture by picture layout to a different page, the present invention provides users with search results Sort concept and the corresponding description.

[0061] 相关检索6: [0061] Related Terms Search 6:

[0062] 申请(专利)号:2010105171639,名称:图像聚类方法和系统 [0062] Application (patent) number: 2010105171639, Title: Method and system image clustering

[0063] 该专利采用参数估计的方式建立图像的有向图,并且以分割有向图的方式进行图像聚类。 [0063] This patent uses parameter estimation way to create an image of a directed graph, and with the clustering image segmentation have to figure ways. 有向图的分割形成多个子图,而每个子图的图像归为一个类。 There are a plurality of sub-graphs to chart divided, and the image of each sub-classified as a class diagram.

[0064] 技术要点比较: [0064] Techniques comparison:

[0065] 1.该专利利用图的方式进行聚类,图像库表示成一个有向图。 [0065] FIG. 1. This patent use way clustering, image library is represented as a directed graph. 本发明通过从小到大的方式聚合图片形成图片类,每一层聚类考虑不同的图像上下文特征。 The present invention, by way of small to large polymeric image formed image categories, each image layer clustering into account different contextual features.

[0066] 相关检索7 : [0066] Related to retrieve 7:

[0067] 申请(专利)号:2005800393866,名称:图像聚类方法和系统 [0067] No. Application (patent): 2005800393866, Title: Method and system image clustering

[0068] 该专利利用时间地点特征对图像按照事件进行聚类,采用的聚类算法根据不同的时间范围进行不同层的聚类。 [0068] This patent the use of time and place in accordance with the characteristics of the image events, and the clustering algorithm used different layers depending on the time of clustering.

[0069]技术要点比较: [0069] Techniques comparison:

[0070]1.该专利的多层聚类中的层是不同时间范围,而本发明的层是按照不同特征所定义的层。 [0070] 1. The patent multilayer clustering layer is different from the time range, and layers of the present invention is a layer defined according to the different characteristics.

[0071] 2.该专利按照事件序列进行聚类,而本发明按照不同的实体区分不同的图片类。 [0071] 2. The sequence of events in accordance with patent clustering, and the present invention is to distinguish between the different entities in accordance with different classes of images.

[0072] 与现有技术相比,本发明创造性地利用三种不同的特征,和对应的三层聚类算法, 对图片进行聚类,并且为每一个类提供概念标注,使得图片搜索结果更好地按照不同实体组织起来,并且每个实体类具有高精度,不同实体之间具有明显的区分度。 [0072] Compared with the prior art, the present invention creatively utilizes three different characteristics, and the corresponding three clustering algorithms, image clustering, and provides the conceptual mark for each class, so that image search results more well organized according to the different entities together, and each entity class with high precision, a clear discrimination between the different entities. 本发明把整个框架分成了在线和离线两个部分,大大减小了在线聚类的时间开销。 The entire framework of the present invention into the online and offline in two parts, greatly reducing the time overhead line clustering.

附图说明 Brief Description

[0073] 通过阅读参照以下附图对非限制性实施例所作的详细描述,本发明的其它特征、 目的和优点将会变得更明显: [0073] The detailed description is read with reference to non-limiting embodiments given with the following drawings, other features of the invention, objects and advantages will become more apparent:

[0074] 图1示出本发明的系统框架图; [0074] FIG. 1 shows a system diagram of the present invention, the frame;

[0075] 图2示出本发明的三层聚类算法示例图。 [0075] FIG. 2 shows three examples of clustering algorithms view of the invention.

具体实施方式 DETAILED DESCRIPTION

[0076] 下面结合附图对本发明的实施例作详细说明,本实施例在以发明技术方案为前提下进行实施,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施例。 [0076] the following with reference to embodiments of the present invention will be described in detail, the present embodiment aspect of the invention to be implemented under the premise, gives a detailed description and specific course of action, but the scope of the present invention is not limited to the following examples.

[0077] 本实施例的任务是对用户输入的查询关键词"bean",获取搜索引擎图片搜索结果,对结果中的不同"bean"的实例进行聚类,以辨别不同的实体,并为每个不同的"bean" 提供前不同的概念标注。 Quest [0077] The present embodiment is entered by the user query keywords "bean", get search engine image search results, and the results of Examples of different "bean" of clustering to identify the different entities, and for every different "bean" offer different concepts before labeling.

[0078] 如图1所示,本发明的离线系统的元数据抽取模块对本实施例"bean"相关的所有原始网页进行元数据上下文抽取。 [0078] 1, the off-line system of the present invention, metadata extraction module in this case "bean" all original pages related to the implementation of context metadata extraction. 如某网页的URL为: As a page URL is:

[0079] "http://domain. com/53C316-C2oJ5/mr_bean. jpg" [0079] "http: // domain com / 53C316-C2oJ5 / mr_bean jpg.."

[0080] 元数据抽取模块通过分割符将词分开,并利用二类分类器将有效字符检测出来。 [0080] metadata extraction module by the word separator to separate and classify the use of two types of characters will effectively detected. 如:"mr bean"。 Such as: "mr bean". 离线系统的概念化模块对"bean"的元数据以及相关网页进行了概念化,得到元数据概念向量和文本概念向量。 Offline system conceptualization module "bean" of metadata and related pages were conceptualized to give the metadata concept vector concept vector and text.

[0081] 当接收到用户的查询关键词"bean"后,在线系统的文本上下文抽取模块从概念化的文本中找到图片和查询关键词"bean"的位置,并且抽取前后50个概念作为文本上下文概念向量。 [0081] Upon receiving the user's query keyword "bean", the text online system context extraction module to find pictures and query keyword "bean" position from the text conceptualized, and before and after the extraction 50 concept concept as text context Vector. 利用元数据概念向量和文本上下文概念向量,在线系统进行三层聚类。 Use metadata concept vector and text contextual concept vector, line system three clusters.

[0082] 如图2所示,在线系统的三层聚类模块首先按照元数据概念向量计算图片相似度并进行聚合层次聚类(图片1和图片2的概念向量皆包含概念"Mr. Bean",而图片3和图片4皆没找到有效的元数据概念)。 [0082] Figure 2, three-line system of clustering module first calculated in accordance with the metadata concept vector image similarity and polymerization hierarchical clustering (Image 1 and Image 2 are conceptual vector containing the concept "Mr. Bean" while picture 3 and picture 4 are not found effective metadata concept). 在聚合层次聚类中,类间的相似度用类的概念向量来计算。 In the polymerization hierarchical clustering, the concept of vector similarity between classes with the class to calculate. 系统从第一层聚类的结果计算出类的概念向量,如图片1和图片2形成了一个类,此类的概念向量包含概念"Mr. Bean"。 From the results of the first layer of the system to calculate the cluster concept vector class, such as Picture 1 and Picture 2 form a class, the concept of vectors containing such concepts "Mr. Bean".

[0083] 第二层聚类在第一层聚类的基础上通过扩展图片的概念向量进行进一步聚类。 [0083] The second layer clustering clustering on the basis of the first layer further by extending the clustering concept vector images. 如图2中图片1和图片2形成的类的概念向量加入了概念"Rowan Atkinson",图片3的概念向量加入了"Rowan Atkinson"以及"Comedy",图片4加入了"Blackadder"。 Concept vector as shown in Photo 1 and Photo 2 form of class to join the concept of "Rowan Atkinson", Picture concept vector 3 joined the "Rowan Atkinson" and "Comedy", Picture 4 joined the "Blackadder". 由于扩展后的向量拥有更多共同的概念,在线系统经过第二次层次聚类合并一些相似的类,得到更为大的类。 Since the vector expanded with more common concepts, the online system after the second hierarchical clustering merge some similar classes, get more large classes. 如图2中图片1,2, 3形成了新的类,并且把类的概念向量扩展为"Mr. Bean","Rowan Atkinson","Comedy"。 As shown in pictures 1, 2, 3 to form a new class, and the concept of vector class expanded "Mr. Bean", "Rowan Atkinson", "Comedy".

[0084] 第三层聚类首先对各个类或者图片的向量用维基百科进行扩展,如图2中图片1,2, 3组成的类的概念向量中加入了"Blackadder",图片4加入了"Rowan Atkinson"。 [0084] The third layer of each class or cluster first vector images performed using Wikipedia extend the concept of vectors as shown in pictures 1, 2, 3 in the composition of the class joined the "Blackadder", Picture 4 joined the " Rowan Atkinson ". 通过基于维基百科的扩展,类向量之间拥有更高的相似度。 Wikipedia-based expansion, with a higher degree of similarity between the class vector. 在线系统通过第三次层次聚类去进一步聚合一些原来由于信息量不足而没有合并的类。 Online system through the third hierarchical clustering to further polymerization some of the original due to lack of information and not merged classes. 如图2中的图片4通过扩展向量可以合并到包含图片1,2, 3的类中。 Figure 2 pictures 4 by extending the vector can be incorporated into the class contains a picture 1,2, 3.

[0085] 在三层聚类算法结束后,在线系统分开不同的类别,把所有实体及其图片呈现给用户。 [0085] At the end of the three-tier clustering algorithm, the online system to separate different categories, all the entities and their image presented to the user. 每个实体用对应概念向量中最有代表性的概念(值最大)的前几个概念来描述。 Each entity with the corresponding concept vector most representative concepts (maximum value) of the first few to describe the concept. 比如图2 中的类可以用"Mr. Bean","Rowan Atkinson","Comedy","Blackadder" 等概念来描述关于名为憨豆先生的美国喜剧演员的图片。 For example, 2 class diagram can describe images About American comedian by the name of Mr. Bean "Mr. Bean", "Rowan Atkinson", "Comedy", "Blackadder" concepts.

[0086] 以上对本发明的具体实施例进行了描述。 [0086] or more specific embodiments of the present invention will be described. 需要理解的是,本发明并不局限于上述特定实施方式,本领域技术人员可以在权利要求的范围内做出各种变形或修改,这并不影响本发明的实质内容。 To be understood that the invention is not limited to the specific embodiments, those skilled in the art can make various changes and modifications within the scope of the claims, this does not affect the substance of the present invention.

Patentzitate
Zitiertes PatentEingetragen Veröffentlichungsdatum Antragsteller Titel
CN101751439A *17. Dez. 200823. Juni 2010中国科学院自动化研究所Image retrieval method based on hierarchical clustering
CN102902821A *1. Nov. 201230. Jan. 2013北京邮电大学Methods for labeling and searching advanced semantics of imagse based on network hot topics and device
CN103577537A *24. Sept. 201312. Febr. 2014上海交通大学Image sharing website picture-oriented multi-pairing similarity determining method
US20090094020 *1. Okt. 20089. Apr. 2009Fujitsu LimitedRecommending Terms To Specify Ontology Space
Klassifizierungen
Internationale KlassifikationG06F17/30
UnternehmensklassifikationG06F17/30867
Juristische Ereignisse
DatumCodeEreignisBeschreibung
28. Jan. 2015C06Publication
25. Febr. 2015C10Entry into substantive examination