CN104090966A

CN104090966A - Semi-structured data retrieval method based on graph model

Info

Publication number: CN104090966A
Application number: CN201410338837.7A
Authority: CN
Inventors: 康积华; 张奇; 黄萱菁
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2014-07-16
Filing date: 2014-07-16
Publication date: 2014-10-08

Abstract

The invention belongs to the technical field of information retrieval and particularly relates to a semi-structured data retrieval method based on a graph model. The method mainly comprises three parts, namely word segmentation entry weight dynamic state setting, attribute matching probability and character string similarity calculation. The method is based on a searching frame Indri based on a linguistic model, the searching frame is based on Dirichlet smoothing, good performance is achieved for processing complex searching, and good expandability is achieved. Under the popularization of the application of a navigation system which is used increasingly widely or Location Based Service (LBS), the searching intention of a user obtained for reference through the method, map information retrieval performance can be improved, and accurate and efficient experience is provided for the user. The scheme is fully opened, through the description of the method, the technologies and the resources in the prior field are combined, technical personnel in the technical field can carry out the technical scheme, and the effect of the technology is achieved.

Description

Semi-structured data search method based on graph model

Technical field

The invention belongs to technical field of information retrieval, be specifically related to a kind of semi-structured data retrieval model.

Background technology

The work of traditional semi-structured data retrieval is broadly divided into both direction: keyword retrieval and user view analysis.On the one hand, the accuracy of excavating keyword the free text of inputting from user is limited, and user inputs a large amount of keywords and also can cause kernel keyword fuzzy; On the other hand, a large amount of research work adopts without supervision, method semi-supervised or supervision formula user's inquiry is carried out to semantic analysis.And be limited to the accuracy of semantic analysis work itself, if direct semantics analysis result as inquiry, therefore effect may be greatly affected.To sum up, along with the development of internet, need to improve the performance of semi-structured data retrieval, thereby promote user, experience.

Summary of the invention

In order to overcome the deficiencies in the prior art, the object of the present invention is to provide a kind of search method based on graph model.The method, with reference to user's query intention, can be improved map information retrieving performance, for user provides more accurately and experiences efficiently.

The invention provides a kind of search method of semi-structured map information data, it is retrieved based on the setting of entry weight, attributes match probability and three factors of similarity of character string, and concrete steps are as follows:

(1) for different entries, carry out weight dynamic assignment

By the entry through participle, according to different entry features, adopt the mode of weighted linear combination to carry out the dynamic assignment of weight, obtain the weight of different entries; Described different characteristic comprises NGram word frequency statistics information, user's inquiry log information, database information and the named entity information characteristics of introducing Google; If probability number of times more or that occur in user's inquiry log more or these entries more or that occur in database that this entry occurs in the NGram of Google record are named entities, the weight of this entry is just corresponding higher so; For example inquiry " Fudan University in Shanghai ", the weight of entry " Shanghai " just should be far above " sea be multiple ";

(2) based on attributes match probability, carry out dynamic attribute weights coupling

The probabilistic information that occurs in different attribute in semi-structured data of statistics entry, is used method based on naive Bayesian to give different attribute weights for entry.Then weight and the attribute weight of different entries itself are multiplied each other, just obtained the weight of this entry corresponding to each attribute, the retrieval framework Indri re-using based on language model retrieves;

(3) based on similarity of character string coupling, carry out global factor intervention

After initially being returned results by the retrieval framework Indri based on language model, use the value of each attribute in initial user inquiry and database to calculate its string editing distance, and with string editing range information, initial ranking results is reordered, obtain final ranking results; Result is returned the most at last.

Beneficial effect of the present invention is: this method can complete retrieval tasks very efficiently, and retrieval performance is improved.

Accompanying drawing explanation

Fig. 1 is the basic flow sheet of the inventive method.

Fig. 2 is graph model factor graph.

Embodiment

In the present invention, the basic flow sheet that carries out semi-structured map information data retrieval based on graph model as shown in Figure 1.It first carries out participle by entry, then, entry after participle is carried out to weight setting and attributes match obtains tentatively inquiring about entry, then will tentatively inquire about entry and reorder according to similarity of character string, then returns to net result.

In the present invention, based on graph model, carry out semi-structured map information data retrieval, this models coupling three of the setting of entry weight, attributes match probability and similarity of character string because usually obtaining final result for retrieval, as shown in Figure 2.Given one inquiry q, by after participle, generate n entry t1, t2 ... tn}, semi-structured retrieval is from database, find out and inquire about maximally related information.Semi-structured data storehouse comprises different attributes, a job hunting information database for example, attribute wherein comprises that position, recruitment enterprise, company's industry, the number of recruits, age require, employ the information such as form (full-time or part-time), wages treatment, work place.Here we with att1, att2 ..., attm} represents the attribute that semi-structured data is concentrated.

One, entry weight arranges

We are that weight composed in an entry with weight characteristic set, for example " Fudan University in Shanghai ", the weight that can obviously learn " sea is multiple " should be less than the weight of " Fudan University ", therefore we adopt weight characteristic line method of weighting to come for each entry generates certain weights ω (t), and formula is as follows:

Wherein t represents entry, and Φ represents weight characteristic set, and σ represents the weight of this feature itself., λ _φthe weight of representative feature φ

Table 1 has shown that we are for compose the characteristic set of weight to entry, and the calculating of these features is all very efficient, even for very large data set.Shown in our following form of characteristic set:

Feature	Describe
		GF	The frequency that entry t occurs in Google n-grams
QF	The frequency that entry t occurs in user's inquiry log
		CF	At semi-structured data, concentrate the frequency that in most important attribute, entry t occurs
EF	Entry t is a physical name, and its value is 1, otherwise is 0

[0022]two, contents attribute matching probability

Consider that user's inquiry packet is containing the structural information that belongs to the content of different attribute and do not have to show, for example user's inquiry may be: " practice of .NET Shanghai ", wherein " .NET " belongs to action, and " Shanghai " belongs to work place, and " practice " belongs to job specification.Consider that inquiry " .NET " is a meaningless task in work place, so top priority is that query contents is mated with data attribute.We calculate with reference to PRMS the probability that entry belongs to certain attribute.Thereby use Bayes' theorem, we can infer that this entry belongs to shown in the following formula of probability of certain attribute by prior probability and entry t occurs in certain attribute probability:

P ({att}_{i} | t) = \frac{P (t | {att}_{i}) P ({att}_{i})}{P (w)} = \frac{P (t | {att}_{i}) P ({att}_{i})}{Σ_{k = 1}^{m} P (t | {att}_{k}) P ({att}_{k})} - - - (2)

P (t|att wherein _i) can be by preserving and learn in the stage of setting up index, and P (att _i) be to infer that according to analysis user inquiry log the entry belongs to att _iprior probability.

Three, similarity of character string

By analysis and consult daily record, we find that user's inquiry probably comprises a more than attribute, and for example, in map retrieval, user may input the title of street name, city name or trade company itself.Because these entities may comprise a lot of entries, so entry sequence is also a very important factor.Here we calculate the public entry sequence of information in inquiry and each attribute of document by global factor.Shown in following formula:

φ_{e} (q, TAR) \overset{Δ}{=} \exp (λ Σ_{i = 1}^{m} {EditDist}_{word}) (q, {att}_{i}) - - - (3)

Wherein q is the original query that user inputs, EditDist _wordrepresent the similarity string editing distance of character string, the identical characters quantity of this algorithm based between character string and mutually order.λ represents the weight of similarity of character string.The result that TAR representative obtains based on step 1 and step 2 inquiry obtains affects mark, φ _eweight factor when (q, TAR) is retrieving result reordering.

After obtaining above-mentioned three features, we can obtain final inquiry formula:

\begin{matrix} Score (q, TAR) = \underset{e &Element; Π_{e}}{Π} φ_{e} (q, TAR) \overset{Δ}{=} \underset{e &Element; Π_{e}}{Σ} \log (φ_{e} (q, TAR)) \\ \overset{Δ}{=} \underset{t &Element; Π_{T}}{Σ} Σ_{i = 1}^{m} (ω (t) P ({att}_{i} | t) f (t, {att}_{i})) + λ Σ_{i = 1}^{m} {EditDist}_{word} (q, {att}_{i}) \end{matrix} - - - (4)

Wherein: t represents entry to be checked, the weight that ω (t) is this entry, π is characteristic set, P (att _i| t) be the matching probability of this entry and attribute i, and f (t, att _i) be entry t and the attribute att obtaining based on language model _iretrieval similarity score.

In this formula, according to first feature, obtain entry self weight w (t), then according to attributes match probability, obtain entry corresponding to the weight P (att of different attribute _i| t), the two is multiplied each other and obtains the final weight of entry, be then incorporated into line retrieval with language model, the result for retrieval obtaining and the 3rd feature are that the addition of similarity of character string information is reordered, and obtain final result.For final mark is easily calculated, and consider that the tired monotonicity of taking advantage of is identical with the monotonicity that it is taken the logarithm, we take the logarithm to this mark, and the final score therefore obtaining can be seen the weighting sum of attribute factor and global factor, the possibility that this has been avoided floating number to overflow as.

And obtain after the weight of the different attribute that each entry is corresponding according to step 1 and step 2, we use log-linear retrieval model, this model adopts the matching degree of language model being taken the logarithm to weigh entry t and text X, and introduce Di Li Cray and smoothly solve the situation that in formula, denominator is 0, see following formula:

f (t, X) \overset{Δ}{=} \log \frac{tf (t, X) + μ \frac{tf (t, C)}{| C |}}{μ + | X |} - - - (5)

Wherein tf (t, X) is the number of times that entry t occurs in text X, and μ is for level and smooth parameter, | X| and | C| represents respectively the sum of entry in the sum of entry in X and document sets C.Wherein each entry obtains at the product of weight corresponding to when inquiry based on step 1 and step 2, obtains after PRELIMINARY RESULTS, according to the global factor intervention of step 3, thereby add different marks for each result according to different similarities, realizes and reordering.Based on this framework, we have obtained final result for retrieval.

By testing on the semi-structured data collection different (address information, film information, job information), compare with existing existing method, this model has all been obtained outstanding performance, shows that this model has very important significance for promoting semi-structured data retrieval performance tool.Wherein, address date comprises 10 attributes, is respectively title, address, province, city, district, classification, postcode, phone, another name and longitude and latitude; Job information has 16 attributes, is respectively position vacant, recruitment enterprise, company size, company's type, company's industry, sex requirement, the number of recruits, age requirement, employs form, cut-off date, educational requirement, wages treatment, working experience, work place, job description, welfare; Film information has 11 attributes, i.e. movie name, time, the date of showing, language, kind, country, performer, staff, synopsis, prize-winning situation, distributing and releasing corporation.

The semi-structured data retrieval performance experiment of table 1 based on graph model

Data set	The gain of normalization accumulation of discount	Average Accuracy	Article front ten, result accuracy rate
				Address information	0.5942	46.2％	70.1％
Job information	0.7175	60.7％	73.2％
				Film information	0.6909	64.4％	73.8％

Claims

1. the semi-structured data search method based on graph model, is characterized in that, it is retrieved based on the setting of entry weight, attributes match probability and three factors of similarity of character string, and concrete steps are as follows:

(1) for different entries, carry out weight dynamic assignment

By the entry through participle, according to different entry features, adopt the mode of weighted linear combination to carry out the dynamic assignment of weight, obtain the weight of different entries; Described different characteristic comprises NGram word frequency statistics information, user's inquiry log information, database information and the named entity information characteristics of introducing Google;

The probabilistic information that statistics occurs in different attribute in semi-structured data through the entry of participle, is used method based on naive Bayesian to give different attribute weights for entry; Then the weight of the different entries that obtain in step (1) attribute weight different with this entry multiplied each other, obtain this entry corresponding to the weight of each attribute, using the retrieval framework Indri based on language model to retrieve;

After initially being returned results by the retrieval framework Indri based on language model, use the value of each attribute in initial user inquiry and database to calculate its string editing distance, and with string editing range information, initial ranking results is reordered, obtain final ranking results; Net result is returned.

2. search method according to claim 1, is characterized in that: in step (1), the weights omega of different entries (t) obtains by following formula,

Wherein: t represents entry, Φ represents weight characteristic set, and σ represents the weight of this feature itself, λ _φthe weight of representative feature φ.

3. search method according to claim 1, is characterized in that: in step (2), different entries belong to the probability P (att of certain attribute _i| t) shown in following formula:

P ({att}_{i} | t) = \frac{P (t | {att}_{i}) P ({att}_{i})}{P (w)} = \frac{P (t | {att}_{i}) P ({att}_{i})}{Σ_{k = 1}^{m} P (t | {att}_{k}) P ({att}_{k})} - - - (2)

Wherein: P (t|att _i) by preserving and learn in the stage of setting up index, and P (att _i) be that the entry t obtaining according to analysis user inquiry log belongs to attribute att _iprior probability.

4. search method according to claim 1, is characterized in that: in step (3), by global factor, calculate the public entry sequence of information in inquiry and each attribute of document, as follows shown in formula:

φ_{e} (q, TAR) \overset{Δ}{=} \exp (λ Σ_{i = 1}^{m} {EditDist}_{word}) (q, {att}_{i}) - - - (3)

Wherein: q is the original query that user inputs, EditDist _wordthe similarity string editing distance that represents character string, λ represents the weight of similarity of character string, the result that TAR representative obtains based on step 1 and step 2 inquiry obtains affects mark, φ _eweight factor when (q, TAR) is retrieving result reordering.

5. search method according to claim 1, is characterized in that: while retrieving based on graph model, it adopts following formula to judge:

\begin{matrix} Score (q, TAR) = \underset{e &Element; Π_{e}}{Π} φ_{e} (q, TAR) \overset{Δ}{=} \underset{e &Element; Π_{e}}{Σ} \log (φ_{e} (q, TAR)) \\ \overset{Δ}{=} \underset{t &Element; Π_{T}}{Σ} Σ_{i = 1}^{m} (ω (t) P ({att}_{i} | t) f (t, {att}_{i})) + λ Σ_{i = 1}^{m} {EditDist}_{word} (q, {att}_{i}) \end{matrix} - - - (4)