CN104317867A

CN104317867A - System for carrying out entity clustering on web pictures returned by search engine

Info

Publication number: CN104317867A
Application number: CN201410554684.XA
Authority: CN
Inventors: 朱其立; 赵凯祺; 蔡智源; 隋清宇; 魏恩勋
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2014-10-17
Filing date: 2014-10-17
Publication date: 2015-01-28
Anticipated expiration: 2034-10-17
Also published as: CN104317867B

Abstract

The invention relates to a system for carrying out entity clustering on web pictures returned by a search engine. The system comprises an offline system and an online system, wherein the offline system is used for preprocessing a source webpage in which all pictures are stored, the online system is used for receiving the inquiry, submitting the inquiry to the search engine and receiving multiple pages of returned picture results, concept element data and text of the source webpage are found for each page of returned results, an inquiry context and a picture context are extracted from the concept text, the online system carries out the three-layer clustering on the element data, the context and the expanded context after the context is expanded in a concept manner, a relevant descriptive concept is automatically marked for each category so as to know the entity of each category. The three-layer clustering algorithm has identical time complexity with an ordinary layering clustering algorithm; by subdividing the characteristics, more precision in the input of each layer, i.e. the output of a previous layer can be realized, the clustering effect can be effectively improved, and an accurate descriptive concept can be provided.

Description

The Web page picture returned search engine carries out the system of entity cluster

Technical field

The present invention relates to the natural language processing of field of computer technology, text mining, particularly, relate to the system that the Web page picture returned search engine carries out entity cluster.

Background technology

Along with the universal of internet and Web page picture growing, Web page picture search becomes a large daily use of Internet user gradually.Current photographic search engine mainly returns with the relevant picture of searching keyword.And these pictures often comprise multiple entity of the same name.User needs from Search Results, find desired picture, needs to browse the picture checked and often open and return.In order to improve the readability of Search Results, distinguishing Search Results according to different entities becomes an improvement of image search engine oppositely.

Image clustering is the method for automatic distinguishing different entities.In research in the past, D.Cai is (see Cai, D., He, X., Ma, W.Y., Wen, J.R., Zhang, H.:Organizing www images based on the analysis of page layout and web link structure.ICME 2004) utilize the mode of the piecemeal of view-based access control model to extract the context of Web page picture, and utilize this context and webpage link information to carry out cluster.But due to the instability of vision piecemeal, and the noise data in context, the precision of cluster has very large restriction; Z.Fu is (see Fu, Z., Ip, H.H.S., Lu, H., Lu, Z.:Multi-modal constraint propagation for heterogeneous image clustering.MultiMedia 2011) provide the framework of a kind of combination according to multiple module such as the label of image and the visual signature of image, the constraint by transmitting class on multiple figure realizes image clustering.The deficiency of the extraction precision of current visual signature, this framework can propagate the mistake that visual signature comprises.And the method needs to carry out constraint tramsfer in multiple figure, causes cluster inefficiency, is not suitable for the cluster to online picture search result.Current image clustering method can not the concept of providing a description property go to mark to each class.

Summary of the invention

The present invention is directed to deficiency of the prior art, provide the system that a Web page picture returned search engine carries out entity cluster, picture search result is organized according to different entities better, and each entity class has high precision, there is between different entities obvious discrimination.The present invention divide into online and off-line two parts whole framework, substantially reduces the time overhead of on-line talking.

For achieving the above object, the technical solution adopted in the present invention is as follows:

The Web page picture returned search engine carries out a system for entity cluster, comprises off-line system and on-line system two parts, wherein:

Off-line system, for carrying out pre-service to the source web page at all picture places, comprises extraction Web Page Metadata, former web page text and metadata concept is changed into the set (Concept Vectors) of one group of cum rights concept.Metadata after generalities and web page contents are inquired about for on-line system.

On-line system, for receiving inquiry, be submitted to search engine and also receive the multipage image results returned, returning results for each page, find generalities metadata and the text of source web page, and in the text of generalities, extract context (inquiry context) and the picture context of searching keyword, on-line system utilizes metadata respectively, context, and by wikipedia, the extended context after concept expansion is carried out to context and carry out three strata classes, and be the descriptive concept that each classification automatic marking is relevant, to understand the entity of each classification.

Described off-line system carries out Metadata Extraction, comprises the extraction to entry effective in URL, picture ALT attribute, to the extraction of the effective entry of URL, utilizes two classification device to classify to effective and invalid entry, and returns effective entry.Picture ALT attribute can directly obtain from html source code.

Described off-line system comprises generalities module, comprise the generalities to metadata and the former web page text of picture, generalities are by being mapped to the conceptive of wikipedia the word in metadata and text, metadata and text is made to change into the set of cum rights concept, to calculate similarity, for clustering algorithm, the weights of each concept are the importance of this concept to picture, and it is defined as follows:

CF - IDF (c, d) = CF (c, d) \times \log \frac{| D |}{DF (c)}

Wherein, CF-IDF (c, d) for concept c is to the importance of picture d, comprise two-part product: the frequency CF (c that concept occurs at picture context, d), and reverse context frequency, wherein reverse context frequency is inversely proportional to the contextual quantity D F (c) that concept occurred.

Described on-line system comprises text context abstraction module, contextual information is extracted in the former web page text of generalities, comprise the contextual extraction of picture and inquire about contextual extraction, picture context and inquiry context are all intercepted by the window of a fixed size, 50 concepts before and after such as picture or searching keyword, the text context extracted forms a concept vector, for calculating picture similarity.

Described on-line system comprises three layers of cluster arithmetic module, comprises metadata cluster, text context cluster, and context concept extended clustering three modules, wherein:

Ground floor cluster, carries out polymerization hierarchical clustering by the Concept Vectors after metadata generalities, obtains the cluster result that in class, precision is high, and merges the Concept Vectors of Concept Vectors as class of all pictures in each class.

Wherein, being polymerized hierarchical clustering algorithm utilizes the generalities of class to carry out the Similarity Measure of class.The generalities of class by the Concept Vectors of the picture in class is added, and remove the lower concept of vectorial intermediate value, obtain high-precision genus.The generalities of class define with following formula:

V (C) {c} = \underset{d &Element; C}{Σ} CF - IDF (c, d)

Wherein, c is concept, and C is class, and d is picture in class, and CF-IDF (c, d) is for concept is to the importance of picture.

Second layer cluster, adds the contextual Concept Vectors of generalities, the Concept Vectors of the class obtained after upgrading all ground floor clusters, and carries out polymerization hierarchical clustering to these classes obtained further in the Concept Vectors of each picture.

Third layer cluster, replaces to the vector of each picture the Concept Vectors of expansion, the Concept Vectors of the class obtained after upgrading all second layer clusters, and carries out polymerization hierarchical clustering to these Concept Vectors further.

Wherein, the expansion of vector utilizes the conceptual description page of wikipedia, relevant concept is joined in the Concept Vectors of picture, and upgrades the Concept Vectors of each class.It is more newly defined as following formula:

V^{'} (C) {c} = \underset{c_{i} &Element; V_{C}}{Σ} (V (C) {c_{i}} \times CF - IDF (c, d_{c_{i}}))

Wherein, for concept c is to concept c _iwikipedia the importance of the page, c are described _ifor the concept in current genus vector, before this context extension process is maximum by selected value, k concept is filtered noise data.

With the genus vector drawn after three strata classes to the relevant description concept of each picture category mark: choose the highest front several concept of the Concept Vectors intermediate value of each class for describing the entity representated by such.

The technical matters that the present invention solves comprises:

1. abstract image contextual information, and contextual information is expressed as the vector in concept space, for the calculating of image similarity provides feature.

2. because some image exists the situation of contextual information quantity not sufficient, the invention provides a kind of mechanism of extended context information, contextual Concept Vectors is expanded by wikipedia or other knowledge bases.

3., because different features is different with the degree of correlation of picture, the degree of confidence of the feature that the degree of correlation is higher is higher, and the present invention, in order to effectively utilize the feature of the different degree of correlation to improve the precision of cluster, expands the Concept Vectors of picture successively, and cluster.

Below by way of the contrast that related art and the present invention of retrieval carry out, technical characteristic of the present invention is described.

Coordinate indexing 1:

Application Number (patent): 2012101444570, title: a kind of method of picture cluster and device

This patent documentation, by visual signature to picture, comprises global characteristics and local feature has carried out twice cluster, and second time cluster is cut on the basis of first time cluster.

Technical essential compares:

1. this patent is according to the content of picture, and namely visual signature carries out picture cluster, and utilizes the contextual feature of picture to carry out cluster in the present invention.

2. the secondary cluster of this patent cuts into little class large class, and the present invention synthesizes large class from little birdsing of the same feather flock together, and utilizes each expansion concept vector to carry out the screening of feature, filtered noise data.

3. the Concept Vectors representation that the present invention adopts can describe concept for each class mark, and cannot provide conceptual description based on the cluster mode of image content.

Coordinate indexing 2:

Application Number (patent): 2013106111554, title: a kind of massive image retrieval system based on the compact feature of cluster

This patent documentation carries out cluster by the local feature of image to the image in image library.First retrieve picture cluster by searching keyword when search and then return corresponding image.

Technical essential compares:

1. this patent generates the compact feature of cluster according to the local feature of picture, carries out picture cluster, and utilizes the contextual feature of picture to carry out cluster in the present invention.

2. this patent improves the speed of retrieval by image clustering, and the present invention is by carrying out cluster Search Results and generalities distinguish the Search Results of each classification to provide.

Coordinate indexing 3:

Application Number (patent): 201210545637X, title: a kind of balanced image clustering method based on hierarchical cluster

The picture number of required traversal when the mode of picture cluster that utilizes this patent documentation reduces search.Picture cluster is based on image high dimensional feature data.

Technical essential compares:

1. this patent is according to the high dimensional feature of picture, carries out picture cluster, and utilizes the contextual feature of picture to carry out cluster in the present invention.

2. need the picture of traversal when this patent reduces retrieval by image clustering, the image clustering mode of employing is hierarchical clustering, and the present invention is based on three kinds of different contextual features, is promoted the precision of cluster by the mode of three strata classes.

Coordinate indexing 4:

Application Number (patent): 201210163641X, title: image clustering method

This patent obtains time data and the position data of image by capture apparatus, and utilizes time and position and speed data to carry out cluster as feature.

Technical essential compares:

1. this patent carries out cluster mainly for shooting image, and the present invention is directed to Web page picture and carry out cluster.The image of shooting does not have contextual information, and Web page picture not necessarily takes image, and major part does not have shooting time and position.Both features are different.

2. this patent carries out cluster based on sequence of events, and the present invention is based on Concept Vectors.Concept Vectors may be used for the generation describing concept.

Coordinate indexing 5:

Application Number (patent): 2009801523973, title: use content-based filtering and based on the cluster of theme by image layout in the page

The content of the picture that this patent captures based on equipment, i.e. visual signature, according to different Subject Clusterings, and is mapped to the result of cluster in corresponding photo album.

Technical essential compares:

1. the visual signature cluster of this patent utilization picture, and the present invention utilizes the context of Web page picture to carry out cluster.

2. picture is passed through figure chip layout on the different pages by this patent, and the present invention provides the Search Results of classification for user and describes concept accordingly.

Coordinate indexing 6:

Application Number (patent): 2010105171639, title: image clustering method and system

This patent adopts the mode of parameter estimation to set up the digraph of image, and carries out image clustering in the mode splitting digraph.The segmentation of digraph forms multiple subgraph, and the image of each subgraph is classified as a class.

Technical essential compares:

1. the mode of this patent utilization figure carries out cluster, and image library is expressed as a digraph.The present invention is polymerized picture by mode from small to large and forms picture category, and each strata class considers different image contextual characteristics.

Coordinate indexing 7:

Application Number (patent): 2005800393866, title: image clustering method and system

This patent utilization temporally point patterns carries out cluster to image according to event, and the clustering algorithm of employing carries out the cluster of different layers according to different time ranges.

Technical essential compares:

1. the layer in the multi-level clustering of this patent is different time scope, and layer of the present invention is the layer defined according to different characteristic.

2. this patent carries out cluster according to sequence of events, and the present invention distinguishes different picture categories according to different entities.

Compared with prior art, the present invention creatively utilizes three kinds of different features, with three layers of clustering algorithm of correspondence, cluster is carried out to picture, and for each class provides concept tagging, picture search result is organized according to different entities better, and each entity class has high precision, there is between different entities obvious discrimination.The present invention divide into online and off-line two parts whole framework, substantially reduces the time overhead of on-line talking.

Accompanying drawing explanation

By reading the detailed description done non-limiting example with reference to the following drawings, other features, objects and advantages of the present invention will become more obvious:

Fig. 1 illustrates system framework figure of the present invention;

Fig. 2 illustrates three layers of clustering algorithm exemplary plot of the present invention.

Embodiment

Elaborate to embodiments of the invention below in conjunction with accompanying drawing, the present embodiment is implemented under premised on invention technical scheme, give detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

The task of the present embodiment is the searching keyword " bean " to user's input, obtain search engine picture search result, cluster is carried out to the example of the difference " bean " in result, to distinguish different entities, and provides front different concept tagging for each different " bean ".

As shown in Figure 1, all original web page that the Metadata Extraction module of off-line system of the present invention is relevant to the present embodiment " bean " carry out the extraction of metadata context.URL as certain webpage is:

“http://domain.com/53C316-C2oJ5/mr_bean.jpg”

Word is separated by decollator by Metadata Extraction module, and utilizes two classification device to be detected by significant character.As: " mr bean ".The generalities module of off-line system has carried out generalities to the metadata of " bean " and related web page, obtains metadata Concept Vectors and text concept vector.

When after the searching keyword " bean " receiving user, the text context abstraction module of on-line system finds the position of picture and searching keyword " bean " from the text of generalities, and extracts 50 concepts in front and back as text context Concept Vectors.Utilize metadata Concept Vectors and text context Concept Vectors, on-line system carries out three strata classes.

As shown in Figure 2, first three strata generic modules of on-line system calculate picture similarity according to metadata Concept Vectors and carry out polymer layer time cluster (Concept Vectors of picture 1 and picture 2 all comprises concept " Mr.Bean ", and picture 3 and picture 4 all do not find effective metadata concept).In polymerization hierarchical clustering, the Concept Vectors of the similarity class between class calculates.System calculates the Concept Vectors of class from the result of ground floor cluster, and as picture 1 and picture 2 define a class, this type of Concept Vectors comprises concept " Mr.Bean ".

Second layer cluster carries out further cluster by the Concept Vectors expanding picture on the basis of ground floor cluster.The Concept Vectors of the class formed as picture in Fig. 21 and picture 2 adds concept " Rowan Atkinson ", and the Concept Vectors of picture 3 adds " Rowan Atkinson " and " Comedy ", and picture 4 adds " Blackadder ".Because the vector after expansion has how common concept, on-line system merges some similar classes through second time hierarchical clustering, obtains more large class.As picture in Fig. 21,2,3 define new class, and the Concept Vectors of class is expanded to " Mr.Bean ", " Rowan Atkinson ", " Comedy ".

First third layer cluster is expanded the vector wikipedia of each class or picture, and as picture in Fig. 21,2, add " Blackadder " in the Concept Vectors of the class of 3 compositions, picture 4 adds " Rowan Atkinson ".By the expansion based on wikipedia, between class vector, have higher similarity.On-line system goes to be polymerized some classes originally do not merged because quantity of information is not enough further by third time hierarchical clustering.Picture 1 is comprised, 2, in the class of 3 as the picture 4 in Fig. 2 can be merged into by spread vector.

After three layers of clustering algorithm terminate, the classification that on-line system is separately different, presents to user all entities and picture thereof.Each entity front several concept of concept most representative in corresponding Concept Vectors (being worth maximum) describes.Class in such as Fig. 2 can use " Mr.Bean ", " Rowan Atkinson ", and " Comedy ", concepts such as " Blackadder " describes the picture of the American comedian about handou sir by name.

Above specific embodiments of the invention are described.It is to be appreciated that the present invention is not limited to above-mentioned particular implementation, those skilled in the art can make various distortion or amendment within the scope of the claims, and this does not affect flesh and blood of the present invention.

Claims

1. the Web page picture returned search engine carries out a system for entity cluster, it is characterized in that, comprises off-line system and on-line system, wherein:

Off-line system, for carrying out pre-service to the source web page at all picture places, comprises extraction Web Page Metadata, former web page text and metadata concept are changed into the set of one group of cum rights concept, that is, Concept Vectors, the metadata after generalities and web page contents are inquired about for on-line system;

On-line system, for receiving inquiry, be submitted to search engine and also receive the multipage image results returned, returning results for each page, find generalities metadata and the text of source web page, and in the text of generalities, extract context and the picture context of searching keyword, on-line system utilizes metadata respectively, context, and the extended context after concept expansion is carried out to context carry out three strata classes, and be the descriptive concept that each classification automatic marking is relevant, to understand the entity of each classification.

2. the Web page picture returned search engine according to claim 1 carries out the system of entity cluster, it is characterized in that, described off-line system carries out Metadata Extraction, comprise the extraction to entry effective in URL, picture ALT attribute, wherein to the extraction of the effective entry of URL, be utilize two classification device to classify to effective and invalid entry, and return effective entry.

3. the Web page picture returned search engine according to claim 1 carries out the system of entity cluster, it is characterized in that, described off-line system comprises generalities module, for carrying out concept expansion to context, text is by generalities module, convert the set of cum rights concept to, the weights of each concept are the importance of this concept to picture, and it is defined as follows:

CF - IDF (c, d) = CF (c, d) \times \log \frac{| D |}{DF (c)}

Wherein, CF-IDF(c, d) for concept c is to the importance of picture d, comprise two-part product: the frequency CF (c that concept occurs at picture context, d), and reverse context frequency, wherein reverse context frequency is inversely proportional to the contextual quantity D F (c) that concept occurred, D is the contextual set of all pictures.

4. the Web page picture returned search engine according to claim 1 carries out the system of entity cluster, it is characterized in that, on-line system comprises text context abstraction module, for inputted searching keyword, extracts its generalities inquiry context and picture context.

5. the Web page picture returned search engine according to claim 4 carries out the system of entity cluster, it is characterized in that, described on-line system comprises three layers of cluster arithmetic module, this module is according to the metadata extracted, context, and context three category feature of expansion is from the highest metadata of degree of confidence, to context, the cluster of three levels is carried out to extended context, wherein:

Ground floor cluster, carries out polymerization hierarchical clustering by the Concept Vectors after metadata generalities, obtains the cluster result that in class, precision is high, and merges the Concept Vectors of Concept Vectors as class of all pictures in each class;

Second layer cluster, adds the contextual Concept Vectors of generalities, the Concept Vectors of the class obtained after upgrading all ground floor clusters, and carries out polymerization hierarchical clustering to these classes obtained further in the Concept Vectors of each picture;

6. the Web page picture returned search engine according to claim 5 carries out the system of entity cluster, it is characterized in that, the polymerization hierarchical clustering algorithm used utilizes the generalities of class to carry out the Similarity Measure of class, the generalities of class are by being added the Concept Vectors of the picture in class, and remove the concept that vectorial intermediate value is lower, obtain high-precision genus, the generalities of class define with following formula:

V (C) {c} = \underset{d &Element; C}{Σ} CF - IDF (c, d)

7. the Web page picture returned search engine according to claim 5 carries out the system of entity cluster, it is characterized in that, third layer cluster carries out contextual expansion by wikipedia, the Concept Vectors of picture is replaced to the Concept Vectors of expansion, and upgrade the Concept Vectors of each class, be more newly defined as following formula:

V^{'} (C) {c} = \underset{c_{i} &Element; V_{C}}{Σ} (V (C) {c_{i}} \times CF - IDF (c, d_{c_{i}}))

Wherein, for concept c is to concept c _iwikipedia the importance of the page, V are described _cfor the set of all concepts of current genus vector, c _ifor the concept in current genus vector, before context extension process is maximum by selected value, k concept is filtered noise data.

8. the Web page picture returned search engine according to claim 1 carries out the system of entity cluster, it is characterized in that, the genus vector drawn after utilizing described three strata classes, to the relevant description concept of each picture category mark, chooses the highest front several concept of the Concept Vectors intermediate value of each class for describing the entity representated by such.