CN102081668A

CN102081668A - Information retrieval optimizing method based on domain ontology

Info

Publication number: CN102081668A
Application number: CN 201110025219
Authority: CN
Inventors: 熊晶; 王爱民; 徐建良; 王继鹏; 张长青; 郭涛; 梁燕军; 孙华
Original assignee: Individual
Current assignee: Individual
Priority date: 2011-01-24
Filing date: 2011-01-24
Publication date: 2011-06-01
Anticipated expiration: 2031-01-24
Also published as: CN102081668B

Abstract

The invention provides an information retrieval optimizing method based on domain ontology, comprising the steps of: obtaining query key words submitted by users via a retrieval interface of a retrieval system; performing lexeme expansion for the query key words submitted by users via domain ontology inference in a desired domain of users and according to the built domain ontology, so as to obtain one or more sets of new query strings; submitting the expanded query strings to one or more search engines for retrieval; performing repetition removal and re-sequencing integration for the results fed back by the various search engines; displaying the final result to the user via the retrieval interface. In the invention, the efficiency of the information retrieval relevant to the domain is improved by the lexeme advantage of the domain ontology.

Description

Information retrieval optimization method based on domain body

Technical field

The present invention relates to a kind of network technology, be based on the information retrieval method of search engine specifically.

Background technology

People are to use research tool from the main means that network obtains information, as Google, Baidu, Yahoo etc.The principle of work of search engine comprises three processes substantially: (1) gathers information from the internet, by regularly the information of all website and webpage on the internet being grasped with Web Spider.(2) organize your messages and set up index data base and analyze collecting the webpage of returning by analyzing the directory system program, extract keyword that related web page place website links, type of coding, content of pages comprise, keyword position, rise time, size, with the information such as linking relationship of other webpage, calculate according to certain degree of correlation algorithm, obtain each webpage at the degree of correlation (or importance) that reaches each keyword in the super chain in the content of pages, set up the web page index database with these relevant informations then.(3) in index data base searching order, accept inquiry when the user after keyword search is imported at the interface of search engine, from the web page index database, find all related web pages that meet this keyword by the search system program, according to ready-made degree of correlation numerical ordering, the degree of correlation is high more, and rank is forward more.At last,, organize and return to the user contents such as the chained address of Search Results, content of pages summaries by page generation system.

Present search engine is based on the search engine of keyword matching mostly.Yet these search engines seldom have the semantic reasoning ability.Though Google has adopted some natural language processing techniques, for example, the synonym expansion, but it can not resolve the semantic relation between the notion, caused the reduction of precision ratio so to a certain extent, made that the inquiry return results is not a user institute satisfactory information.On the other hand, user's inquiry depends on certain professional domain to a great extent, as marine field.For example; suppose that the user wants to search for the information of marine field relevant " DIP (Dissolved inorganic phosphorus dissolves Phos) "; its Query Result as shown in Figure 4; usually can obtain a large amount of other fields " DIP " information; as " the Dual Inline Package " of microelectronic, i.e. dual-in-line package technology.Because these are and the incoherent garbage of user's purpose that the user obviously is unsatisfied to such result.

" body (Ontology) " conduct " the clear and definite formalization normalized illustration of shared ideas model " is by taking out the model that the related notion of some phenomenons obtains in the objective world, and the implication of conceptual model performance is independent of concrete ambient condition.What body embodied is the knowledge of common approval, reflection be the concept set of generally acknowledging in the association area, so body provides common understanding and description to domain knowledge, can be used to better share, exchange and reuse.Constitute body notion and between relation through explication, use body can eliminate phenomenons such as polysemy, many speech one justice and the meaning of a word be ambiguous, thereby finish domain knowledge clear, definite, complete definition and description.The target of body research is to obtain a Knowledge Representation Method, makes machine to share and process information as the mankind.At present, ontology is widely used in fields such as the representation of knowledge, information retrieval.

Summary of the invention

In order to overcome the existing deficiency of search engine on semantic retrieval, the invention provides a kind of information retrieval optimization method based on domain body.

Technical scheme of the present invention is: a kind of information retrieval optimization method based on domain body, and its step is as follows:

(1), obtains the key word of the inquiry that the user submits to by the search interface of searching system;

(2) in the field of user expectation, according to the domain body of having set up, the key word of the inquiry that the user is submitted to carries out semantic extension by ontology inference, obtains one or more groups new inquiry string;

(3) inquiry string after will expanding is submitted to one or more search engines and is retrieved;

(4) return results to each search engine goes heavily, sorts and integrate;

(5) net result is shown to the user by search interface.

In the above-mentioned steps (2) based on the semantic extension mode of domain body comprise in the following mode a kind of, two kinds or all:

1. based on the optimization method of is-a relation

Is-a relation (inheritance) has shown the classification of notion, and promptly the example of father's notion equals the summation of sub-notion example.Therefore added some constraints in that son is conceptive, sub-notion is also referred to as the particularization of father's notion.The probability that a notion father notion direct with it or sub-notion occur in same document is higher.Therefore, when search during, can utilize the father's notion P of A or sub-notion C to improve the precision ratio of search as constraint about the document of certain notion A.So the inquiry that a notion can be optimized to notion itself and its father's notion or sub-notion is right.

2. based on the optimization method of part-of relation

Part-of represents whole-part relations, is used for describing the mutual relationship between a notion and its part notion.The ingredient of a notion also therewith the field under the notion be closely related.Therefore, also be associated usually with part notion document matching with its global concept.So the inquiry that a notion can be optimized to notion itself and part notion thereof is right.

3. based on the optimization method of equivalent-class relation

Equivalent-class (equivalence class) relation is used for the synonym phenomenon of process field knowledge.Utilize the equivalent-class relation, the notion in the user inquiring can be mapped to the synonym of equal value with it.Like this, can improve the precision ratio of information retrieval.And, the common householder method of equivalent-class relation as preceding two kinds of optimization methods.

Between the internal notion of described inquiry be " with " or " or " logical relation, " with " can improve the inquiry accuracy rate, " or " can improve recall ratio.

In the above-mentioned steps (4), to the return results of each search engine go heavily, ordering integrates, the algorithm that can adopt is as follows:

(1) URL to Search Results handles, and intercepting " # " URL character string before is as final chained address; If there is MD5 (URL _A)=MD5 (URL _B), then think URL _AAnd URL _BCorresponding page is a duplicate pages, goes heavily;

(2) sort algorithm is considered two aspects:

1. the semantic distance Dist (C of each notion in the inquiry string _i, C _j), C wherein _iWith C _jBe two notions in the inquiry string,

Dist (C_{i}, C_{j}) = Σ_{k = 1}^{n} {ω_{e}}_{k} + \frac{{N_{C}}_{i} + {N_{C}}_{j}}{{N_{C}}_{i} + {N_{C}}_{j} + 2 \times N_{LCA}} \times ϵ

Formula 1

In the formula 1,

Link node C in the expression body tree _i, C _jShortest path in the Weighted distance sum on each limit;

With Represent node C respectively _iAnd C _jWeighted distance to minimum common ancestor's node; N _LCARepresent the Weighted distance of minimum common ancestor to root node; ε is a constant, determines according to weighting coefficient.

The semantic weight of different relations is with reference to table 1 between the notion.

Table 1 semantic distance weight table

In the table 1,

The expression blank operation, single operation is represented in its combination with row; E represents the equivalent-class relation; G represents the is-a relation, and direction is pointed to father's notion by sub-notion; S represents the is-a relation, and direction is pointed to sub-notion by father's notion; P represents the part-of relation.

Because the semantic distance of notion semantic similarity and notion is inverse function each other, when semantic distance was 0, semantic similarity was 1.Therefore can be with C _i, C _jSimilarity between the two is reduced to:

Sim (C_{i}, C_{j}) = \frac{1}{Dist (C_{i}, C_{j}) + 1}

Formula 2

2. inquiry string and Search Results the record degree of correlation Rank (Query, Abstract).

Rank (Query, Abstract) = Σ_{i = 1}^{n} Rank (C_{i}, Abstract)

Formula 3

In the formula 3, Rank (C _i, be the degree of correlation between each notion and the Search Results summary Abstract among the inquiry string Query Abstract), n is the number of notion among the Query.

Rank (C_{i}, Abstract) = m \times Σ_{j = 1}^{m} \ln \frac{len (Abstract)}{Index (C_{i}, j, Abstract)}

Formula 4

In the formula 4, m=Time (C _i, Abstract) be notion C _iThe number of times that in summary Abstract, occurs; The length of len (Abstract) expression summary Abstract; Index (C _i, j Abstract) is notion C _iThe position that the j time occurs in summary Abstract.

3. to original query key word K _iAnd the inquiry string Query of expansion, obtain K respectively _iSemantic similarity with each notion among the Query

Then can calculate the matching degree R of result for retrieval.

R=α Sim (K _i, C _j(Query, Abstract) formula 5 for)+β Rank

In the formula 5, α and β are constant, represent the semantic relevancy of etendue critical word and the weight of the summary degree of correlation thereof respectively.α ∈ (0,1) wherein, β ∈ (0,1), and alpha+beta=1.

4. the order of successively decreasing according to R numerical value is finished the ordering of result for retrieval.

The present invention is recall ratio and the precision ratio that utilizes the relevant information retrieval in the semantic advantage raising field of body.On the basis of the method, user's key word of the inquiry can be utilized domain body carry out semantic extension, obtain one or more groups new query string, then it is submitted to the Web search engine, and Search Results sorted and put in order, finally be shown to the user.Because these new query strings have been considered the relation between the field concept, as hypernym, hyponym, synonym etc., can improve the recall ratio of retrieval; Simultaneously because that body is the field is relevant, make result for retrieval be limited under within the scope in field, can screen out a large amount of and information field independence, thereby improve the precision ratio of retrieval.

Description of drawings

Fig. 1 is a marine ecology domain body fragment;

Fig. 2 is the optimization information searching system OASIS workflow diagram that the present invention is based on domain body;

Fig. 3 is the search interface of OASIS of the present invention;

Fig. 4 is the Search Results homepage sectional drawing that retrieval " DIP " obtains in Google;

Fig. 5 is to be the summary degree of correlation of example calculating with " InorganicNutrient+DIP ";

Fig. 6 is the Search Results sectional drawing that retrieval " DIP " obtains in OASIS of the present invention.

Embodiment

Below by a marine ecology field specific embodiment the present invention is described in further detail.

The present invention proposes a kind of information retrieval optimization method based on domain body, is example with the marine ecology field, in conjunction with the accompanying drawings, specifically describes as follows.

The workflow diagram of committed step of the present invention is an example with the marine ecology field as shown in Figure 2, and when submit queries " DIP ", concrete implementation step is:

1. server is set up a marine ecology body (Ontology), and with the storage of ocean.ont form, its body fragment as shown in Figure 1;

2. pass through search interface shown in Figure 3 at user side, submit to key word of the inquiry " DIP " to inquire about (Portal);

3. server obtains the key word of the inquiry that the user submits to, utilize HozoAPI to carry out semantic reasoning to the ocean.ont body and realize optimizing (Query Optimizer), at notion " DIP ", can get access to relative notion has: based on the notion InorganicNutrient of is-a relation, based on notion Phytoplankton, the Seawater of part-of relation.Obtain three groups of new inquiry strings " InorganicNutrient+DIP ", " DIP+Phytoplankton " and " DIP+Seawater " by the relation between these notions and the notion;

4. these three groups of character strings are sent to Web search engine (Web SearchEngine) respectively, (World Wide Web) obtains three groups of retrieval sets from WWW, get preceding 30 records of each result for retrieval, obtain result set Result_1 respectively, Result_2 and Result_3;

5. server is with Result_1, and Result_2 and Result_3 merge, and resequences after finishing retry, obtains net result collection Result.Main algorithm is as follows:

(1) URL to Search Results handles, and intercepting " # " URL character string before is as final chained address.If there is MD5 (URL _A)=MD5 (URL _B), then think URL _AAnd URL _BCorresponding page is a duplicate pages.

(2) sort algorithm is considered two aspects:

1. the semantic distance Dist (C of each notion in the inquiry string _i, C _j), C wherein _iWith C _jBe two notions in the inquiry string.

Utilize formula 1:

Calculate C _iWith C _jSemantic distance, and by formula 2: Calculate C _iWith C _jSemantic similarity.

2. utilize formula 3:

Calculate the degree of correlation of inquiry string and Search Results record.

And utilize formula 5:R=α Sim (K _i, C _j(Query Abstract) calculates matching degree to)+β Rank, finishes the ordering of result for retrieval by its result's the order of successively decreasing.

Be that the example explanation describes now with inquiry string " InorganicNutrient+DIP ".Two notions are respectively with C _INAnd C _DIPExpression.

By Fig. 1 associative list 1 as can be known

N _LCA=2, get ε=1.Then calculate by formula 1

Calculate by formula 2

Sim (C_{IN}, C_{DIP}) = \frac{1}{Dist (C_{IN}, C_{DIP}) + 1} = 0.27

(Query, correlation parameter Abstract) as shown in Figure 5 to calculate Rank.

Utilize formula 5, get α=0.6, β=0.4:

{R_{URL}}_{1} = 0.6 \times 0.27 + 0.4 \times 4.192 = 1.839

{R_{URL}}_{2} = 0.6 \times 0.27 + 0.4 \times 1.253 = 0.663

Therefore

Come The prostatitis.

6. Result is shown to the user by search interface.As shown in Figure 6.

Said process is the specialty retrieval optimization method that is defaulted as marine ecology domain-specific searching system OASIS and interface 3 with.Also can adopt this professional searching system for other field, but will adopt the association area body.Certainly for comprehensive search engine, then can on search interface, increase field keyword column by user's input, to determine the field of user expectation retrieval according to the field keyword of user's input, for the user strange situation is divided in the field, can the preliminary election association area on the search interface of search engine select during by user search, to determine domain body and to carry out the meaning of a word expansion of association area.For not selecting or do not import the field keyword, then adopt all spectra body when determining domain body.

Claims

1. information retrieval optimization method based on domain body, its step is as follows:

(2) in the field of user expectation, according to the domain body of having set up, the key word of the inquiry that the user is submitted to carries out semantic extension by the domain body reasoning, obtains one or more groups new inquiry string;

(4) return results to each search engine goes heavy and the ordering integration;

(5) net result is shown to the user by search interface.

2. the method for claim 1 is characterized in that describedly carrying out semantic extension by ontology inference, be adopt in the following method one or both or all:

1. based on is-a optimized relation method

Father's notion P or the sub-notion C of the notion A that obtains based on described key word of the inquiry, the inquiry that is optimized to notion A itself and its father's notion P is right, or the inquiry of notion A itself and its sub-notion C is right;

2. based on the optimization method of part-of relation

The inquiry that will be optimized to this notion itself and its part notion formation based on the notion that key word of the inquiry obtains is right;

3. the optimization that concerns based on equivalent-class

It is right to be optimized to the inquiry that this notion and the synonym of equal value with it constitute based on the notion that key word of the inquiry obtains.

3. method as claimed in claim 2, it is characterized in that between the internal notion of described inquiry for " with " or " or " logical relation.

4. as described method one of in the claim 1 to 3, it is characterized in that: described go heavily to be meant for Search Results URL handle, intercepting " # " URL character string before is as final chained address, for URL _AAnd URL _BIf there is MD5 (URL _A)=MD5 (URL _B), then think URL _AAnd URL _BCorresponding page is a duplicate pages, removes one of them chained address.

5. method as claimed in claim 4 is characterized in that: described ordering is to utilize the semantic similarity of notion in conjunction with the summary sort algorithm, and the result after going is heavily sorted.

6. method as claimed in claim 5 is characterized in that described sort method comprises:

1. calculate the semantic distance Dist (C of each notion in the inquiry string by formula 1 _i, C _j),

Dist (C_{i}, C_{j}) = Σ_{k = 1}^{n} {ω_{e}}_{k} + \frac{{N_{C}}_{i} + {N_{C}}_{j}}{{N_{C}}_{i} + {N_{C}}_{j} + 2 \times N_{LCA}} \times ϵ

Formula 1

C wherein _iWith C _jBe two notions in the inquiry string, Link node C in the expression body tree _i, C _jShortest path in the Weighted distance sum on each limit; With

Represent node C respectively _iAnd C _jWeighted distance to minimum common ancestor's node; N _LCARepresent the Weighted distance of minimum common ancestor to root node; ε is a constant, determines according to weighting coefficient,

When semantic distance was 0, semantic similarity was 1, with C _i, C _jSimilarity between the two is reduced to formula 2:

Sim (C_{i}, C_{j}) = \frac{1}{Dist (C_{i}, C_{j}) + 1}

Formula 2

2. by formula 3 determine the degree of correlation Rank that inquiry strings and Search Results write down (Query, Abstract)

Rank (Query, Abstract) = Σ_{i = 1}^{n} Rank (C_{i}, Abstract)

Formula 3

In the formula 3, Rank (C _i, be the degree of correlation between each notion and the Search Results summary Abstract among the inquiry string Query Abstract), n is the number of notion among the Query

Rank (C_{i}, Abstract) = m \times Σ_{j = 1}^{m} \ln \frac{len (Abstract)}{Index (C_{i}, j, Abstract)}

Formula 4

In the formula 4, m=Time (C _i, Abstract) be notion C _iThe number of times that in summary Abstract, occurs; The length of len (Abstract) expression summary Abstract; Index (C _i, j Abstract) is notion C _iThe position that the j time occurs in summary Abstract,

3. to original query key word K _iAnd the inquiry string Query of expansion, obtain K respectively _iSemantic similarity with each notion among the Query Calculate the matching degree R of result for retrieval by formula 5.

R=α Sim (K _i, C _j(Query, Abstract) formula 5 for)+β Rank

In the formula 5, α and β are constant, represent the semantic relevancy of etendue critical word and the weight of the summary degree of correlation thereof respectively, α ∈ (0,1) wherein, and β ∈ (0,1), and alpha+beta=1,

7. as described method one of in the claim 1 to 3, it is characterized in that: described search interface is the special interface at a certain field.

8. as described method one of in the claim 1 to 3, it is characterized in that: described search interface has field option or field key word to fill in the zone, field option of selecting according to the user in described step (2) or field key word load corresponding domain body and carry out semantic extension.