CN104090966A - Semi-structured data retrieval method based on graph model - Google Patents

Semi-structured data retrieval method based on graph model Download PDF

Info

Publication number
CN104090966A
CN104090966A CN201410338837.7A CN201410338837A CN104090966A CN 104090966 A CN104090966 A CN 104090966A CN 201410338837 A CN201410338837 A CN 201410338837A CN 104090966 A CN104090966 A CN 104090966A
Authority
CN
China
Prior art keywords
att
entry
weight
attribute
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410338837.7A
Other languages
Chinese (zh)
Inventor
康积华
张奇
黄萱菁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201410338837.7A priority Critical patent/CN104090966A/en
Publication of CN104090966A publication Critical patent/CN104090966A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The invention belongs to the technical field of information retrieval and particularly relates to a semi-structured data retrieval method based on a graph model. The method mainly comprises three parts, namely word segmentation entry weight dynamic state setting, attribute matching probability and character string similarity calculation. The method is based on a searching frame Indri based on a linguistic model, the searching frame is based on Dirichlet smoothing, good performance is achieved for processing complex searching, and good expandability is achieved. Under the popularization of the application of a navigation system which is used increasingly widely or Location Based Service (LBS), the searching intention of a user obtained for reference through the method, map information retrieval performance can be improved, and accurate and efficient experience is provided for the user. The scheme is fully opened, through the description of the method, the technologies and the resources in the prior field are combined, technical personnel in the technical field can carry out the technical scheme, and the effect of the technology is achieved.

Description

Semi-structured data search method based on graph model
Technical field
The invention belongs to technical field of information retrieval, be specifically related to a kind of semi-structured data retrieval model.
Background technology
The work of traditional semi-structured data retrieval is broadly divided into both direction: keyword retrieval and user view analysis.On the one hand, the accuracy of excavating keyword the free text of inputting from user is limited, and user inputs a large amount of keywords and also can cause kernel keyword fuzzy; On the other hand, a large amount of research work adopts without supervision, method semi-supervised or supervision formula user's inquiry is carried out to semantic analysis.And be limited to the accuracy of semantic analysis work itself, if direct semantics analysis result as inquiry, therefore effect may be greatly affected.To sum up, along with the development of internet, need to improve the performance of semi-structured data retrieval, thereby promote user, experience.
Summary of the invention
In order to overcome the deficiencies in the prior art, the object of the present invention is to provide a kind of search method based on graph model.The method, with reference to user's query intention, can be improved map information retrieving performance, for user provides more accurately and experiences efficiently.
The invention provides a kind of search method of semi-structured map information data, it is retrieved based on the setting of entry weight, attributes match probability and three factors of similarity of character string, and concrete steps are as follows:
(1) for different entries, carry out weight dynamic assignment
By the entry through participle, according to different entry features, adopt the mode of weighted linear combination to carry out the dynamic assignment of weight, obtain the weight of different entries; Described different characteristic comprises NGram word frequency statistics information, user's inquiry log information, database information and the named entity information characteristics of introducing Google; If probability number of times more or that occur in user's inquiry log more or these entries more or that occur in database that this entry occurs in the NGram of Google record are named entities, the weight of this entry is just corresponding higher so; For example inquiry " Fudan University in Shanghai ", the weight of entry " Shanghai " just should be far above " sea be multiple ";
(2) based on attributes match probability, carry out dynamic attribute weights coupling
The probabilistic information that occurs in different attribute in semi-structured data of statistics entry, is used method based on naive Bayesian to give different attribute weights for entry.Then weight and the attribute weight of different entries itself are multiplied each other, just obtained the weight of this entry corresponding to each attribute, the retrieval framework Indri re-using based on language model retrieves;
(3) based on similarity of character string coupling, carry out global factor intervention
After initially being returned results by the retrieval framework Indri based on language model, use the value of each attribute in initial user inquiry and database to calculate its string editing distance, and with string editing range information, initial ranking results is reordered, obtain final ranking results; Result is returned the most at last.
Beneficial effect of the present invention is: this method can complete retrieval tasks very efficiently, and retrieval performance is improved.
Accompanying drawing explanation
Fig. 1 is the basic flow sheet of the inventive method.
Fig. 2 is graph model factor graph.
Embodiment
In the present invention, the basic flow sheet that carries out semi-structured map information data retrieval based on graph model as shown in Figure 1.It first carries out participle by entry, then, entry after participle is carried out to weight setting and attributes match obtains tentatively inquiring about entry, then will tentatively inquire about entry and reorder according to similarity of character string, then returns to net result.
In the present invention, based on graph model, carry out semi-structured map information data retrieval, this models coupling three of the setting of entry weight, attributes match probability and similarity of character string because usually obtaining final result for retrieval, as shown in Figure 2.Given one inquiry q, by after participle, generate n entry t1, t2 ... tn}, semi-structured retrieval is from database, find out and inquire about maximally related information.Semi-structured data storehouse comprises different attributes, a job hunting information database for example, attribute wherein comprises that position, recruitment enterprise, company's industry, the number of recruits, age require, employ the information such as form (full-time or part-time), wages treatment, work place.Here we with att1, att2 ..., attm} represents the attribute that semi-structured data is concentrated.
One, entry weight arranges
We are that weight composed in an entry with weight characteristic set, for example " Fudan University in Shanghai ", the weight that can obviously learn " sea is multiple " should be less than the weight of " Fudan University ", therefore we adopt weight characteristic line method of weighting to come for each entry generates certain weights ω (t), and formula is as follows:
Wherein t represents entry, and Φ represents weight characteristic set, and σ represents the weight of this feature itself., λ φthe weight of representative feature φ
Table 1 has shown that we are for compose the characteristic set of weight to entry, and the calculating of these features is all very efficient, even for very large data set.Shown in our following form of characteristic set:
Feature Describe
GF The frequency that entry t occurs in Google n-grams
QF The frequency that entry t occurs in user's inquiry log
CF At semi-structured data, concentrate the frequency that in most important attribute, entry t occurs
EF Entry t is a physical name, and its value is 1, otherwise is 0
[0022]two, contents attribute matching probability
Consider that user's inquiry packet is containing the structural information that belongs to the content of different attribute and do not have to show, for example user's inquiry may be: " practice of .NET Shanghai ", wherein " .NET " belongs to action, and " Shanghai " belongs to work place, and " practice " belongs to job specification.Consider that inquiry " .NET " is a meaningless task in work place, so top priority is that query contents is mated with data attribute.We calculate with reference to PRMS the probability that entry belongs to certain attribute.Thereby use Bayes' theorem, we can infer that this entry belongs to shown in the following formula of probability of certain attribute by prior probability and entry t occurs in certain attribute probability:
P ( att i | t ) = P ( t | att i ) P ( att i ) P ( w ) = P ( t | att i ) P ( att i ) Σ k = 1 m P ( t | att k ) P ( att k ) - - - ( 2 )
P (t|att wherein i) can be by preserving and learn in the stage of setting up index, and P (att i) be to infer that according to analysis user inquiry log the entry belongs to att iprior probability.
Three, similarity of character string
By analysis and consult daily record, we find that user's inquiry probably comprises a more than attribute, and for example, in map retrieval, user may input the title of street name, city name or trade company itself.Because these entities may comprise a lot of entries, so entry sequence is also a very important factor.Here we calculate the public entry sequence of information in inquiry and each attribute of document by global factor.Shown in following formula:
φ e ( q , TAR ) = Δ exp ( λ Σ i = 1 m EditDist word ) ( q , att i ) - - - ( 3 )
Wherein q is the original query that user inputs, EditDist wordrepresent the similarity string editing distance of character string, the identical characters quantity of this algorithm based between character string and mutually order.λ represents the weight of similarity of character string.The result that TAR representative obtains based on step 1 and step 2 inquiry obtains affects mark, φ eweight factor when (q, TAR) is retrieving result reordering.
After obtaining above-mentioned three features, we can obtain final inquiry formula:
Score ( q , TAR ) = Π e ∈ Π e φ e ( q , TAR ) = Δ Σ e ∈ Π e log ( φ e ( q , TAR ) ) = Δ Σ t ∈ Π T Σ i = 1 m ( ω ( t ) P ( att i | t ) f ( t , att i ) ) + λ Σ i = 1 m EditDist word ( q , att i ) - - - ( 4 )
Wherein: t represents entry to be checked, the weight that ω (t) is this entry, π is characteristic set, P (att i| t) be the matching probability of this entry and attribute i, and f (t, att i) be entry t and the attribute att obtaining based on language model iretrieval similarity score.
In this formula, according to first feature, obtain entry self weight w (t), then according to attributes match probability, obtain entry corresponding to the weight P (att of different attribute i| t), the two is multiplied each other and obtains the final weight of entry, be then incorporated into line retrieval with language model, the result for retrieval obtaining and the 3rd feature are that the addition of similarity of character string information is reordered, and obtain final result.For final mark is easily calculated, and consider that the tired monotonicity of taking advantage of is identical with the monotonicity that it is taken the logarithm, we take the logarithm to this mark, and the final score therefore obtaining can be seen the weighting sum of attribute factor and global factor, the possibility that this has been avoided floating number to overflow as.
And obtain after the weight of the different attribute that each entry is corresponding according to step 1 and step 2, we use log-linear retrieval model, this model adopts the matching degree of language model being taken the logarithm to weigh entry t and text X, and introduce Di Li Cray and smoothly solve the situation that in formula, denominator is 0, see following formula:
f ( t , X ) = Δ log tf ( t , X ) + μ tf ( t , C ) | C | μ + | X | - - - ( 5 )
Wherein tf (t, X) is the number of times that entry t occurs in text X, and μ is for level and smooth parameter, | X| and | C| represents respectively the sum of entry in the sum of entry in X and document sets C.Wherein each entry obtains at the product of weight corresponding to when inquiry based on step 1 and step 2, obtains after PRELIMINARY RESULTS, according to the global factor intervention of step 3, thereby add different marks for each result according to different similarities, realizes and reordering.Based on this framework, we have obtained final result for retrieval.
By testing on the semi-structured data collection different (address information, film information, job information), compare with existing existing method, this model has all been obtained outstanding performance, shows that this model has very important significance for promoting semi-structured data retrieval performance tool.Wherein, address date comprises 10 attributes, is respectively title, address, province, city, district, classification, postcode, phone, another name and longitude and latitude; Job information has 16 attributes, is respectively position vacant, recruitment enterprise, company size, company's type, company's industry, sex requirement, the number of recruits, age requirement, employs form, cut-off date, educational requirement, wages treatment, working experience, work place, job description, welfare; Film information has 11 attributes, i.e. movie name, time, the date of showing, language, kind, country, performer, staff, synopsis, prize-winning situation, distributing and releasing corporation.
The semi-structured data retrieval performance experiment of table 1 based on graph model
Data set The gain of normalization accumulation of discount Average Accuracy Article front ten, result accuracy rate
Address information 0.5942 46.2% 70.1%
Job information 0.7175 60.7% 73.2%
Film information 0.6909 64.4% 73.8%

Claims (5)

1. the semi-structured data search method based on graph model, is characterized in that, it is retrieved based on the setting of entry weight, attributes match probability and three factors of similarity of character string, and concrete steps are as follows:
(1) for different entries, carry out weight dynamic assignment
By the entry through participle, according to different entry features, adopt the mode of weighted linear combination to carry out the dynamic assignment of weight, obtain the weight of different entries; Described different characteristic comprises NGram word frequency statistics information, user's inquiry log information, database information and the named entity information characteristics of introducing Google;
(2) based on attributes match probability, carry out dynamic attribute weights coupling
The probabilistic information that statistics occurs in different attribute in semi-structured data through the entry of participle, is used method based on naive Bayesian to give different attribute weights for entry; Then the weight of the different entries that obtain in step (1) attribute weight different with this entry multiplied each other, obtain this entry corresponding to the weight of each attribute, using the retrieval framework Indri based on language model to retrieve;
(3) based on similarity of character string coupling, carry out global factor intervention
After initially being returned results by the retrieval framework Indri based on language model, use the value of each attribute in initial user inquiry and database to calculate its string editing distance, and with string editing range information, initial ranking results is reordered, obtain final ranking results; Net result is returned.
2. search method according to claim 1, is characterized in that: in step (1), the weights omega of different entries (t) obtains by following formula,
Wherein: t represents entry, Φ represents weight characteristic set, and σ represents the weight of this feature itself, λ φthe weight of representative feature φ.
3. search method according to claim 1, is characterized in that: in step (2), different entries belong to the probability P (att of certain attribute i| t) shown in following formula:
P ( att i | t ) = P ( t | att i ) P ( att i ) P ( w ) = P ( t | att i ) P ( att i ) Σ k = 1 m P ( t | att k ) P ( att k ) - - - ( 2 )
Wherein: P (t|att i) by preserving and learn in the stage of setting up index, and P (att i) be that the entry t obtaining according to analysis user inquiry log belongs to attribute att iprior probability.
4. search method according to claim 1, is characterized in that: in step (3), by global factor, calculate the public entry sequence of information in inquiry and each attribute of document, as follows shown in formula:
φ e ( q , TAR ) = Δ exp ( λ Σ i = 1 m EditDist word ) ( q , att i ) - - - ( 3 )
Wherein: q is the original query that user inputs, EditDist wordthe similarity string editing distance that represents character string, λ represents the weight of similarity of character string, the result that TAR representative obtains based on step 1 and step 2 inquiry obtains affects mark, φ eweight factor when (q, TAR) is retrieving result reordering.
5. search method according to claim 1, is characterized in that: while retrieving based on graph model, it adopts following formula to judge:
Score ( q , TAR ) = Π e ∈ Π e φ e ( q , TAR ) = Δ Σ e ∈ Π e log ( φ e ( q , TAR ) ) = Δ Σ t ∈ Π T Σ i = 1 m ( ω ( t ) P ( att i | t ) f ( t , att i ) ) + λ Σ i = 1 m EditDist word ( q , att i ) - - - ( 4 )
Wherein: t represents entry to be checked, the weight that ω (t) is this entry, π is characteristic set, P (att i| t) be the matching probability of this entry and attribute i, and f (t, att i) be entry t and the attribute att obtaining based on language model iretrieval similarity score.
CN201410338837.7A 2014-07-16 2014-07-16 Semi-structured data retrieval method based on graph model Pending CN104090966A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410338837.7A CN104090966A (en) 2014-07-16 2014-07-16 Semi-structured data retrieval method based on graph model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410338837.7A CN104090966A (en) 2014-07-16 2014-07-16 Semi-structured data retrieval method based on graph model

Publications (1)

Publication Number Publication Date
CN104090966A true CN104090966A (en) 2014-10-08

Family

ID=51638682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410338837.7A Pending CN104090966A (en) 2014-07-16 2014-07-16 Semi-structured data retrieval method based on graph model

Country Status (1)

Country Link
CN (1) CN104090966A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462501A (en) * 2014-12-19 2015-03-25 北京奇虎科技有限公司 Knowledge graph construction method and device based on structural data
CN109829500A (en) * 2019-01-31 2019-05-31 华南理工大学 A kind of position composition and automatic clustering method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298606A (en) * 2011-06-01 2011-12-28 清华大学 Random walking image automatic annotation method and device based on label graph model
US20120096042A1 (en) * 2010-10-19 2012-04-19 Microsoft Corporation User query reformulation using random walks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120096042A1 (en) * 2010-10-19 2012-04-19 Microsoft Corporation User query reformulation using random walks
CN102298606A (en) * 2011-06-01 2011-12-28 清华大学 Random walking image automatic annotation method and device based on label graph model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QI ZHANG: "Map Search via A Factor Graph Model", 《PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462501A (en) * 2014-12-19 2015-03-25 北京奇虎科技有限公司 Knowledge graph construction method and device based on structural data
CN109829500A (en) * 2019-01-31 2019-05-31 华南理工大学 A kind of position composition and automatic clustering method
CN109829500B (en) * 2019-01-31 2023-05-02 华南理工大学 Position composition and automatic clustering method

Similar Documents

Publication Publication Date Title
CN107122413B (en) Keyword extraction method and device based on graph model
CN106156272A (en) A kind of information retrieval method based on multi-source semantic analysis
CN103268348B (en) A kind of user's query intention recognition methods
CN111104794A (en) Text similarity matching method based on subject words
CN101853272B (en) Search engine technology based on relevance feedback and clustering
CN102253982B (en) Query suggestion method based on query semantics and click-through data
US9104733B2 (en) Web search ranking
US20080114750A1 (en) Retrieval and ranking of items utilizing similarity
CN104268142B (en) Based on the Meta Search Engine result ordering method for being rejected by strategy
CN103425687A (en) Retrieval method and system based on queries
CN104484380A (en) Personalized search method and personalized search device
CN104036051B (en) A kind of database schema abstraction generating method propagated based on label
CN104298776A (en) LDA model-based search engine result optimization system
CN103198136B (en) A kind of PC file polling method based on sequential correlation
US10678820B2 (en) System and method for computerized semantic indexing and searching
CN102156728B (en) Improved personalized summary system based on user interest model
CN103246644A (en) Method and device for processing Internet public opinion information
CN103218373A (en) System, method and device for relevant searching
Minkov et al. Improving graph-walk-based similarity with reranking: Case studies for personal information management
CN101937433A (en) Real-time searching method of product
CN101840438B (en) Retrieval system oriented to meta keywords of source document
CN103324707A (en) Query expansion method based on semi-supervised clustering
Azzam et al. A question routing technique using deep neural network for communities of question answering
Abdulhayoglu et al. Using character n-grams to match a list of publications to references in bibliographic databases
CN111651675B (en) UCL-based user interest topic mining method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141008