CN104090966A - Semi-structured data retrieval method based on graph model - Google Patents
Semi-structured data retrieval method based on graph model Download PDFInfo
- Publication number
- CN104090966A CN104090966A CN201410338837.7A CN201410338837A CN104090966A CN 104090966 A CN104090966 A CN 104090966A CN 201410338837 A CN201410338837 A CN 201410338837A CN 104090966 A CN104090966 A CN 104090966A
- Authority
- CN
- China
- Prior art keywords
- att
- entry
- weight
- attribute
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Abstract
The invention belongs to the technical field of information retrieval and particularly relates to a semi-structured data retrieval method based on a graph model. The method mainly comprises three parts, namely word segmentation entry weight dynamic state setting, attribute matching probability and character string similarity calculation. The method is based on a searching frame Indri based on a linguistic model, the searching frame is based on Dirichlet smoothing, good performance is achieved for processing complex searching, and good expandability is achieved. Under the popularization of the application of a navigation system which is used increasingly widely or Location Based Service (LBS), the searching intention of a user obtained for reference through the method, map information retrieval performance can be improved, and accurate and efficient experience is provided for the user. The scheme is fully opened, through the description of the method, the technologies and the resources in the prior field are combined, technical personnel in the technical field can carry out the technical scheme, and the effect of the technology is achieved.
Description
Technical field
The invention belongs to technical field of information retrieval, be specifically related to a kind of semi-structured data retrieval model.
Background technology
The work of traditional semi-structured data retrieval is broadly divided into both direction: keyword retrieval and user view analysis.On the one hand, the accuracy of excavating keyword the free text of inputting from user is limited, and user inputs a large amount of keywords and also can cause kernel keyword fuzzy; On the other hand, a large amount of research work adopts without supervision, method semi-supervised or supervision formula user's inquiry is carried out to semantic analysis.And be limited to the accuracy of semantic analysis work itself, if direct semantics analysis result as inquiry, therefore effect may be greatly affected.To sum up, along with the development of internet, need to improve the performance of semi-structured data retrieval, thereby promote user, experience.
Summary of the invention
In order to overcome the deficiencies in the prior art, the object of the present invention is to provide a kind of search method based on graph model.The method, with reference to user's query intention, can be improved map information retrieving performance, for user provides more accurately and experiences efficiently.
The invention provides a kind of search method of semi-structured map information data, it is retrieved based on the setting of entry weight, attributes match probability and three factors of similarity of character string, and concrete steps are as follows:
(1) for different entries, carry out weight dynamic assignment
By the entry through participle, according to different entry features, adopt the mode of weighted linear combination to carry out the dynamic assignment of weight, obtain the weight of different entries; Described different characteristic comprises NGram word frequency statistics information, user's inquiry log information, database information and the named entity information characteristics of introducing Google; If probability number of times more or that occur in user's inquiry log more or these entries more or that occur in database that this entry occurs in the NGram of Google record are named entities, the weight of this entry is just corresponding higher so; For example inquiry " Fudan University in Shanghai ", the weight of entry " Shanghai " just should be far above " sea be multiple ";
(2) based on attributes match probability, carry out dynamic attribute weights coupling
The probabilistic information that occurs in different attribute in semi-structured data of statistics entry, is used method based on naive Bayesian to give different attribute weights for entry.Then weight and the attribute weight of different entries itself are multiplied each other, just obtained the weight of this entry corresponding to each attribute, the retrieval framework Indri re-using based on language model retrieves;
(3) based on similarity of character string coupling, carry out global factor intervention
After initially being returned results by the retrieval framework Indri based on language model, use the value of each attribute in initial user inquiry and database to calculate its string editing distance, and with string editing range information, initial ranking results is reordered, obtain final ranking results; Result is returned the most at last.
Beneficial effect of the present invention is: this method can complete retrieval tasks very efficiently, and retrieval performance is improved.
Accompanying drawing explanation
Fig. 1 is the basic flow sheet of the inventive method.
Fig. 2 is graph model factor graph.
Embodiment
In the present invention, the basic flow sheet that carries out semi-structured map information data retrieval based on graph model as shown in Figure 1.It first carries out participle by entry, then, entry after participle is carried out to weight setting and attributes match obtains tentatively inquiring about entry, then will tentatively inquire about entry and reorder according to similarity of character string, then returns to net result.
In the present invention, based on graph model, carry out semi-structured map information data retrieval, this models coupling three of the setting of entry weight, attributes match probability and similarity of character string because usually obtaining final result for retrieval, as shown in Figure 2.Given one inquiry q, by after participle, generate n entry t1, t2 ... tn}, semi-structured retrieval is from database, find out and inquire about maximally related information.Semi-structured data storehouse comprises different attributes, a job hunting information database for example, attribute wherein comprises that position, recruitment enterprise, company's industry, the number of recruits, age require, employ the information such as form (full-time or part-time), wages treatment, work place.Here we with att1, att2 ..., attm} represents the attribute that semi-structured data is concentrated.
One, entry weight arranges
We are that weight composed in an entry with weight characteristic set, for example " Fudan University in Shanghai ", the weight that can obviously learn " sea is multiple " should be less than the weight of " Fudan University ", therefore we adopt weight characteristic line method of weighting to come for each entry generates certain weights ω (t), and formula is as follows:
Wherein t represents entry, and Φ represents weight characteristic set, and σ represents the weight of this feature itself., λ
φthe weight of representative feature φ
Table 1 has shown that we are for compose the characteristic set of weight to entry, and the calculating of these features is all very efficient, even for very large data set.Shown in our following form of characteristic set:
Feature | Describe |
GF | The frequency that entry t occurs in Google n-grams |
QF | The frequency that entry t occurs in user's inquiry log |
CF | At semi-structured data, concentrate the frequency that in most important attribute, entry t occurs |
EF | Entry t is a physical name, and its value is 1, otherwise is 0 |
[0022]two, contents attribute matching probability
Consider that user's inquiry packet is containing the structural information that belongs to the content of different attribute and do not have to show, for example user's inquiry may be: " practice of .NET Shanghai ", wherein " .NET " belongs to action, and " Shanghai " belongs to work place, and " practice " belongs to job specification.Consider that inquiry " .NET " is a meaningless task in work place, so top priority is that query contents is mated with data attribute.We calculate with reference to PRMS the probability that entry belongs to certain attribute.Thereby use Bayes' theorem, we can infer that this entry belongs to shown in the following formula of probability of certain attribute by prior probability and entry t occurs in certain attribute probability:
P (t|att wherein
i) can be by preserving and learn in the stage of setting up index, and P (att
i) be to infer that according to analysis user inquiry log the entry belongs to att
iprior probability.
Three, similarity of character string
By analysis and consult daily record, we find that user's inquiry probably comprises a more than attribute, and for example, in map retrieval, user may input the title of street name, city name or trade company itself.Because these entities may comprise a lot of entries, so entry sequence is also a very important factor.Here we calculate the public entry sequence of information in inquiry and each attribute of document by global factor.Shown in following formula:
Wherein q is the original query that user inputs, EditDist
wordrepresent the similarity string editing distance of character string, the identical characters quantity of this algorithm based between character string and mutually order.λ represents the weight of similarity of character string.The result that TAR representative obtains based on step 1 and step 2 inquiry obtains affects mark, φ
eweight factor when (q, TAR) is retrieving result reordering.
After obtaining above-mentioned three features, we can obtain final inquiry formula:
Wherein: t represents entry to be checked, the weight that ω (t) is this entry, π is characteristic set, P (att
i| t) be the matching probability of this entry and attribute i, and f (t, att
i) be entry t and the attribute att obtaining based on language model
iretrieval similarity score.
In this formula, according to first feature, obtain entry self weight w (t), then according to attributes match probability, obtain entry corresponding to the weight P (att of different attribute
i| t), the two is multiplied each other and obtains the final weight of entry, be then incorporated into line retrieval with language model, the result for retrieval obtaining and the 3rd feature are that the addition of similarity of character string information is reordered, and obtain final result.For final mark is easily calculated, and consider that the tired monotonicity of taking advantage of is identical with the monotonicity that it is taken the logarithm, we take the logarithm to this mark, and the final score therefore obtaining can be seen the weighting sum of attribute factor and global factor, the possibility that this has been avoided floating number to overflow as.
And obtain after the weight of the different attribute that each entry is corresponding according to step 1 and step 2, we use log-linear retrieval model, this model adopts the matching degree of language model being taken the logarithm to weigh entry t and text X, and introduce Di Li Cray and smoothly solve the situation that in formula, denominator is 0, see following formula:
Wherein tf (t, X) is the number of times that entry t occurs in text X, and μ is for level and smooth parameter, | X| and | C| represents respectively the sum of entry in the sum of entry in X and document sets C.Wherein each entry obtains at the product of weight corresponding to when inquiry based on step 1 and step 2, obtains after PRELIMINARY RESULTS, according to the global factor intervention of step 3, thereby add different marks for each result according to different similarities, realizes and reordering.Based on this framework, we have obtained final result for retrieval.
By testing on the semi-structured data collection different (address information, film information, job information), compare with existing existing method, this model has all been obtained outstanding performance, shows that this model has very important significance for promoting semi-structured data retrieval performance tool.Wherein, address date comprises 10 attributes, is respectively title, address, province, city, district, classification, postcode, phone, another name and longitude and latitude; Job information has 16 attributes, is respectively position vacant, recruitment enterprise, company size, company's type, company's industry, sex requirement, the number of recruits, age requirement, employs form, cut-off date, educational requirement, wages treatment, working experience, work place, job description, welfare; Film information has 11 attributes, i.e. movie name, time, the date of showing, language, kind, country, performer, staff, synopsis, prize-winning situation, distributing and releasing corporation.
The semi-structured data retrieval performance experiment of table 1 based on graph model
Data set | The gain of normalization accumulation of discount | Average Accuracy | Article front ten, result accuracy rate |
Address information | 0.5942 | 46.2% | 70.1% |
Job information | 0.7175 | 60.7% | 73.2% |
Film information | 0.6909 | 64.4% | 73.8% |
Claims (5)
1. the semi-structured data search method based on graph model, is characterized in that, it is retrieved based on the setting of entry weight, attributes match probability and three factors of similarity of character string, and concrete steps are as follows:
(1) for different entries, carry out weight dynamic assignment
By the entry through participle, according to different entry features, adopt the mode of weighted linear combination to carry out the dynamic assignment of weight, obtain the weight of different entries; Described different characteristic comprises NGram word frequency statistics information, user's inquiry log information, database information and the named entity information characteristics of introducing Google;
(2) based on attributes match probability, carry out dynamic attribute weights coupling
The probabilistic information that statistics occurs in different attribute in semi-structured data through the entry of participle, is used method based on naive Bayesian to give different attribute weights for entry; Then the weight of the different entries that obtain in step (1) attribute weight different with this entry multiplied each other, obtain this entry corresponding to the weight of each attribute, using the retrieval framework Indri based on language model to retrieve;
(3) based on similarity of character string coupling, carry out global factor intervention
After initially being returned results by the retrieval framework Indri based on language model, use the value of each attribute in initial user inquiry and database to calculate its string editing distance, and with string editing range information, initial ranking results is reordered, obtain final ranking results; Net result is returned.
2. search method according to claim 1, is characterized in that: in step (1), the weights omega of different entries (t) obtains by following formula,
Wherein: t represents entry, Φ represents weight characteristic set, and σ represents the weight of this feature itself, λ
φthe weight of representative feature φ.
3. search method according to claim 1, is characterized in that: in step (2), different entries belong to the probability P (att of certain attribute
i| t) shown in following formula:
Wherein: P (t|att
i) by preserving and learn in the stage of setting up index, and P (att
i) be that the entry t obtaining according to analysis user inquiry log belongs to attribute att
iprior probability.
4. search method according to claim 1, is characterized in that: in step (3), by global factor, calculate the public entry sequence of information in inquiry and each attribute of document, as follows shown in formula:
Wherein: q is the original query that user inputs, EditDist
wordthe similarity string editing distance that represents character string, λ represents the weight of similarity of character string, the result that TAR representative obtains based on step 1 and step 2 inquiry obtains affects mark, φ
eweight factor when (q, TAR) is retrieving result reordering.
5. search method according to claim 1, is characterized in that: while retrieving based on graph model, it adopts following formula to judge:
Wherein: t represents entry to be checked, the weight that ω (t) is this entry, π is characteristic set, P (att
i| t) be the matching probability of this entry and attribute i, and f (t, att
i) be entry t and the attribute att obtaining based on language model
iretrieval similarity score.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410338837.7A CN104090966A (en) | 2014-07-16 | 2014-07-16 | Semi-structured data retrieval method based on graph model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410338837.7A CN104090966A (en) | 2014-07-16 | 2014-07-16 | Semi-structured data retrieval method based on graph model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104090966A true CN104090966A (en) | 2014-10-08 |
Family
ID=51638682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410338837.7A Pending CN104090966A (en) | 2014-07-16 | 2014-07-16 | Semi-structured data retrieval method based on graph model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104090966A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462501A (en) * | 2014-12-19 | 2015-03-25 | 北京奇虎科技有限公司 | Knowledge graph construction method and device based on structural data |
CN109829500A (en) * | 2019-01-31 | 2019-05-31 | 华南理工大学 | A kind of position composition and automatic clustering method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298606A (en) * | 2011-06-01 | 2011-12-28 | 清华大学 | Random walking image automatic annotation method and device based on label graph model |
US20120096042A1 (en) * | 2010-10-19 | 2012-04-19 | Microsoft Corporation | User query reformulation using random walks |
-
2014
- 2014-07-16 CN CN201410338837.7A patent/CN104090966A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120096042A1 (en) * | 2010-10-19 | 2012-04-19 | Microsoft Corporation | User query reformulation using random walks |
CN102298606A (en) * | 2011-06-01 | 2011-12-28 | 清华大学 | Random walking image automatic annotation method and device based on label graph model |
Non-Patent Citations (1)
Title |
---|
QI ZHANG: "Map Search via A Factor Graph Model", 《PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462501A (en) * | 2014-12-19 | 2015-03-25 | 北京奇虎科技有限公司 | Knowledge graph construction method and device based on structural data |
CN109829500A (en) * | 2019-01-31 | 2019-05-31 | 华南理工大学 | A kind of position composition and automatic clustering method |
CN109829500B (en) * | 2019-01-31 | 2023-05-02 | 华南理工大学 | Position composition and automatic clustering method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107122413B (en) | Keyword extraction method and device based on graph model | |
CN106156272A (en) | A kind of information retrieval method based on multi-source semantic analysis | |
CN103268348B (en) | A kind of user's query intention recognition methods | |
CN111104794A (en) | Text similarity matching method based on subject words | |
CN101853272B (en) | Search engine technology based on relevance feedback and clustering | |
CN102253982B (en) | Query suggestion method based on query semantics and click-through data | |
US9104733B2 (en) | Web search ranking | |
US20080114750A1 (en) | Retrieval and ranking of items utilizing similarity | |
CN104268142B (en) | Based on the Meta Search Engine result ordering method for being rejected by strategy | |
CN103425687A (en) | Retrieval method and system based on queries | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN104036051B (en) | A kind of database schema abstraction generating method propagated based on label | |
CN104298776A (en) | LDA model-based search engine result optimization system | |
CN103198136B (en) | A kind of PC file polling method based on sequential correlation | |
US10678820B2 (en) | System and method for computerized semantic indexing and searching | |
CN102156728B (en) | Improved personalized summary system based on user interest model | |
CN103246644A (en) | Method and device for processing Internet public opinion information | |
CN103218373A (en) | System, method and device for relevant searching | |
Minkov et al. | Improving graph-walk-based similarity with reranking: Case studies for personal information management | |
CN101937433A (en) | Real-time searching method of product | |
CN101840438B (en) | Retrieval system oriented to meta keywords of source document | |
CN103324707A (en) | Query expansion method based on semi-supervised clustering | |
Azzam et al. | A question routing technique using deep neural network for communities of question answering | |
Abdulhayoglu et al. | Using character n-grams to match a list of publications to references in bibliographic databases | |
CN111651675B (en) | UCL-based user interest topic mining method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20141008 |