CN102567420B - Document retrieval method and device - Google Patents

Document retrieval method and device Download PDF

Info

Publication number
CN102567420B
CN102567420B CN201010621819.1A CN201010621819A CN102567420B CN 102567420 B CN102567420 B CN 102567420B CN 201010621819 A CN201010621819 A CN 201010621819A CN 102567420 B CN102567420 B CN 102567420B
Authority
CN
China
Prior art keywords
retrieval
document
participle
search key
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010621819.1A
Other languages
Chinese (zh)
Other versions
CN102567420A (en
Inventor
童征宇
徐剑波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Founder Apabi Technology Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201010621819.1A priority Critical patent/CN102567420B/en
Publication of CN102567420A publication Critical patent/CN102567420A/en
Application granted granted Critical
Publication of CN102567420B publication Critical patent/CN102567420B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention discloses a document retrieval method and device, relating to the field of computer information processing. The method and the device are used for solving the problem that the retrieval results can not be sequenced according to the positions of retrieval segmented words in the documents and the data length of the documents. The method comprises the following steps: sequencing a plurality of retrieved documents according to the positions of the retrieval segmented words in the retrieval keywords in the retrieved documents and the data length of the retrieved documents after retrieving a plurality of documents containing the retrieval segmented words in the retrieval keywords; and returning the retrieved documents as the retrieval results according to the sequencing result of the retrieved documents. Visibly, the retrieval results can be sequenced according to the positions of the retrieval segmented words in the documents and the data length of the documents by adopting the method and the device.

Description

Document retrieval method and device
Technical field
The present invention relates to computer information processing field, relate in particular to a kind of document retrieval method and device.
Background technology
Full-text search refers to that text retrieval system passes through each word in scanned document, each word is set up to an index entry, indicate number of times and position that this word occurs in document, when user submits retrieval request to, text retrieval system, with regard to searching according to the index file of setting up in advance, returns to the result of searching user's retrieval mode according to certain sortord.In practical application, the document that text retrieval system is processed may comprise a plurality of fields, as title, author, text etc.
Concrete, user, submit to after retrieval request, the retrieval participle that search key in text retrieval system Analysis deterrmination retrieval request comprises, retrieval participle refers to search key is carried out to the participle forming after character is divided, specifically how search key is carried out to character division, according to different algorithms, there is different dividing mode, for example, each character in search key can be divided into a retrieval participle, also every two characters in search key can be divided into a retrieval participle, etc.; Then, in indexed file, search the document that comprises all retrieval participles, and the information of the document is offered to user as result for retrieval.In phrase retrieval, be the in the situation that of comprising a plurality of retrieval participle in search key, after searching the document that comprises all retrieval participles, also need further the position relationship of a plurality of retrieval participles in the document to be mated, whether consistent with the position relationship of a plurality of retrieval participles that comprise in retrieval request to determine the position relationship of the plurality of retrieval participle, if coupling is consistent, the information of the document is offered to user as result for retrieval, otherwise, not using the document as result for retrieval.For example, the retrieval participle comprising in search key comprises " participle " and " rule ", and the position relationship of these two retrieval participles is adjacent, be not comprise other character between these two retrieval participles, after finding the document that comprises " participle " and " rule ", position relationship to " participle " in the document and " rule " mates, if " participle " is adjacent with " rule " in the document, be in the document, to comprise " word segmentation regulation ", the information of the document is offered to user as result for retrieval, otherwise, not using the document as result for retrieval.
Obtain after a plurality of result for retrieval searching, a plurality of result for retrieval need to be sorted according to certain rule, finally according to clooating sequence, a plurality of result for retrieval are offered to user.For various text retrieval systems, the demand whether sequence of result for retrieval meets user is to evaluate its good and bad key factor.At present, text retrieval system is generally used vector space model to sort to result for retrieval, concrete, this model is according to word frequency (Team Frequency, TF)/inverted entry frequency (Invert Document Frequency, IDF), calculate the quantization weight value of retrieval participle in document, according to the quantization weight value of each document calculating, each document is sorted.TF refers to the frequency that a retrieval participle occurs in document, and it describes the importance of this retrieval participle in one piece of particular document; IDF is inverted entry frequency, what it was described is the frequency that retrieval participle occurs in all documents, this retrieves the general importance of participle, as " I ", " what " these words almost all can occur in all documents, even so the frequency that these words occur in one piece of specific document is very high, neither be very important.Be exactly generally speaking that the priority of one piece of particular document, is directly proportional to the TF of retrieval participle, is inversely proportional to IDF.
In realizing process of the present invention, inventor finds to exist in prior art following technical matters:
In existing result for retrieval sortord, according to TF and the IDF of retrieval participle, result for retrieval is sorted, how according to retrieval participle, appear at position in document and the data length of document sorts to result for retrieval, also there is no concrete implementation at present.
Summary of the invention
The embodiment of the present invention provides a kind of document retrieval method and device, for solving, cannot appear at the problem that the position of document and the data length of document sort to result for retrieval according to retrieval participle.
A document retrieval method, the method comprises:
Retrieve after a plurality of documents that comprise the whole retrieval participles in search key, the data length of the position according to the retrieval participle in described search key in a plurality of documents that retrieve and a plurality of documents that retrieve, sorts to a plurality of documents that retrieve;
According to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.
A document retrieving apparatus, this device comprises:
Retrieval unit, for retrieving a plurality of documents of the whole retrieval participles that comprise search key;
Sequencing unit, the data length for the position in a plurality of documents that retrieve and a plurality of documents that retrieve according to the retrieval participle of described search key, sorts to a plurality of documents that retrieve;
Result is returned to unit, for the ranking results according to a plurality of documents that retrieve are sorted, a plurality of documents that retrieve is returned as result for retrieval.
In the scheme that the embodiment of the present invention provides, after retrieving a plurality of documents that comprise the whole retrieval participles in search key, the data length of the position according to the retrieval participle in described search key in a plurality of documents that retrieve and a plurality of documents that retrieve, a plurality of documents that retrieve are sorted, and according to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.Visible, adopt the present invention, can appear at position in document and the data length of document sorts to result for retrieval according to retrieval participle, and then make the ranking results of search file more accurate, better meet user's request.
Accompanying drawing explanation
The method flow schematic diagram that Fig. 1 provides for the embodiment of the present invention;
The other method schematic flow sheet that Fig. 2 provides for the embodiment of the present invention;
The document retrieving apparatus structural representation that Fig. 3 provides for the embodiment of the present invention;
Another document retrieving apparatus structural representation that Fig. 4 provides for the embodiment of the present invention.
Embodiment
The document occurring at desired location in order to retrieve retrieval participle, the embodiment of the present invention provides a kind of document retrieval method, in this method, after receiving retrieval request, one or more retrieval participles (the being participle to be retrieved) condition that residing position need to be satisfied in destination document that needs the search key in deterministic retrieval request to comprise, then search the document that the position that comprises definite whole retrieval participles and participle to be retrieved place meets this condition, finally the information of the document finding is returned as result for retrieval.
Referring to Fig. 1, the document retrieval method that the embodiment of the present invention provides, specifically comprises the following steps:
Step 10: receive the retrieval request that comprises search key;
Here, search key refers to the keyword for retrieving of the external world (for example user) input;
Step 11: determine retrieval participle that described search key comprises and the participle to be retrieved condition that residing position need to be satisfied in destination document in this retrieval participle;
Step 12: search the document that the position that comprises definite whole retrieval participles and participle to be retrieved place meets described condition;
Step 13: the information of the document finding is returned as result for retrieval.
In step 11, when described participle to be retrieved comprises first retrieval participle that search key comprises, determine this first retrieval participle condition that residing position need to be satisfied in destination document, its specific implementation can be as follows:
According to the form of search key, first retrieval position of participle in destination document and required satisfied primary importance relation of reference position of destination document that deterministic retrieval keyword comprises, first retrieval participle condition that residing position need to be satisfied in destination document that this primary importance relation is comprised as search key.
Concrete, the specific implementation of above-mentioned definite primary importance relation can be as follows:
First, determine in search key whether there is asterisk wildcard before first retrieval participle; When being defined as being, according to the corresponding relation of predefined asterisk wildcard type and distance value scope, determine positional distance value corresponding to asterisk wildcard of first retrieval before participle, and determine that described primary importance is closed and be: at the distance value between first retrieval participle and the bebinning character of destination document described in destination document within the scope of described distance value; Be defined as when no, determining that described primary importance is closed and be: first retrieval participle is being positioned at the reference position of destination document.
Certainly, the above-mentioned form according to search key determines that the realization of primary importance relation is not limited to the mode of above-mentioned asterisk wildcard, and any other form according to search key determines that the mode of primary importance relation is all in protection scope of the present invention.
In step 11, when described participle to be retrieved comprises last retrieval participle that search key comprises, determine this last retrieval participle condition that residing position need to be satisfied in destination document, its specific implementation can be as follows:
According to the form of search key, last the retrieval position of participle in destination document and required satisfied second place relation of end position of destination document that deterministic retrieval keyword comprises, last retrieval participle condition that residing position need to be satisfied in destination document that this second place relation is comprised as search key.
Concrete, the specific implementation of above-mentioned definite second place relation can be as follows:
Determine in search key and whether there is asterisk wildcard after last retrieval participle; When being defined as being, according to the corresponding relation of predefined asterisk wildcard type and positional distance value, determine distance value scope corresponding to asterisk wildcard of last retrieval after participle, and determine that the described second place is closed and be: at the distance value between last retrieval participle and the ending character of destination document described in destination document within the scope of described distance value; Be defined as when no, determining that the described second place is closed and be: last retrieval participle is being positioned at the end position of destination document.
Certainly, the above-mentioned form according to search key determines that the realization of second place relation is not limited to the mode of above-mentioned asterisk wildcard, and any other form according to search key determines that the mode of second place relation is all in protection scope of the present invention.
Illustrate, when described asterisk wildcard is asterisk, described distance value scope is to be not less than 0 integer; When described asterisk wildcard is question mark, the distance value scope that described question mark is corresponding is 0 or 1.
Receive the retrieval request that comprises search key in step 10 before, can set up index file for one or more documents, in this index file, comprise retrieval participle and the positional information of this retrieval participle in corresponding document that each document comprises;
Accordingly, the specific implementation of step 12 can be divided into following three kinds of situations:
The first, corresponding to participle to be retrieved comprise in search key first retrieval participle situation, specific as follows:
First, according to described index file, determine the document that comprises the whole retrieval participles in search key; Then, from described index file, read the positional information of first retrieval participle in the document in search key, according to this positional information, determine whether position in the document of this first retrieval participle and the reference position of the document meet described primary importance relation; If, the position that comprises definite whole retrieval participles and participle to be retrieved place that the document is defined as finding meets the document of described condition, otherwise the position that comprises definite whole retrieval participles and participle to be retrieved place that the document is not defined as finding meets the document of described condition.
The second, corresponding to participle to be retrieved comprise in search key last retrieval participle situation, specific as follows:
First, according to described index file, determine the document that comprises the whole retrieval participles in search key; Then, from described index file, read the positional information of last retrieval participle in the document in described search key, according to this positional information, determine whether the position of this last retrieval participle in the document and the reference position of the document meet described second place relation; If, the position that comprises definite whole retrieval participles and participle to be retrieved place that the document is defined as finding meets the document of described condition, otherwise the position that comprises definite whole retrieval participles and participle to be retrieved place that the document is not defined as finding meets the document of described condition.
The third, comprise the situation of first and last the retrieval participle in search key corresponding to participle to be retrieved, specific as follows:
First, according to described index file, determine the document that comprises the whole retrieval participles in search key; Then, from described index file, read first retrieval participle and the positional information of last retrieval participle in the document in search key, according to the positional information reading, determine whether this first position of retrieval participle in the document meets described primary importance relation with the reference position of the document, and this last retrieve the position of participle in the document and whether the end position of the document meets described second place relation; If, the position that comprises definite whole retrieval participles and participle to be retrieved place that the document is defined as finding meets the document of described condition, otherwise the position that comprises definite whole retrieval participles and participle to be retrieved place that the document is not defined as finding meets the document of described condition.
Certainly, participle to be retrieved except can comprise in search key first with last retrieval participle, also can comprise any other retrieval participle in search key, in this case, the specific implementation of determining this participle to be retrieved condition that residing position need to be satisfied in destination document in step 11 can be as follows: according to the form of search key, determine the position of this participle to be retrieved in destination document and reference position and/or the required satisfied position relationship of end position of destination document, using this position relationship as this participle to be retrieved condition that residing position need to be satisfied in destination document.Accordingly, the specific implementation of step 12 can be as follows: first, determine the document that comprises the whole retrieval participles in search key according to index file; Then, from index file, read the positional information of this participle to be retrieved in the document, according to this positional information, determine whether the position of this participle to be retrieved in the document and reference position and/or the end position of the document meet corresponding position relationship; If, the position that comprises definite whole retrieval participles and participle to be retrieved place that the document is defined as finding meets the document of described condition, otherwise the position that comprises definite whole retrieval participles and participle to be retrieved place that the document is not defined as finding meets the document of described condition.
Preferably, between step 12 and step 13, the method further comprises:
Determine that described search key comprises whether respectively retrieve the position relationship of participle in the document finding consistent with the position relationship in described search key;
Accordingly, in step 13, what determine that described search key comprises, respectively retrieve position relationship in the document that participle finding when consistent with position relationship in described search key, the information of the document finding is returned as result for retrieval.
Preferably, between step 12 and step 13, the data length of the position of the retrieval participle that can comprise according to search key in each document finding and each document finding, sorts to each document finding; Accordingly, in step 13, according to the ranking results that each document finding is sorted, each document finding is returned as result for retrieval.
The data length of the position of the above-mentioned retrieval participle comprising according to described search key in each document finding and each document finding, sorts to each document finding, and its specific implementation can be as follows:
First, according to predefined file correlation value computing formula, calculate respectively the relevance degree of each document finding; Described file correlation value computing formula meets the following conditions: the position of the retrieval participle that described search key comprises in document is more forward, the relevance degree calculating according to the document relevance degree computing formula is larger, the data length of document is less, and the relevance degree calculating according to the document relevance degree computing formula is larger;
Then, according to the size order of the relevance degree of each document calculating, each document is sorted.
Described file correlation value computing formula can comprise:
Formula one: scord ( d ) = Σ t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
Formula two: ExactNorm ( len ) = 1 . 0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, the positional value of t of pos(t) comprising for described search key retrieval participle in document, and N is the number of the retrieval participle that comprises of described search key.
Certainly; file correlation value computing formula is not limited to above-mentioned formula one and formula two; any formula with following characteristic is all in protection scope of the present invention: the position of the retrieval participle that search key comprises in document is more forward; result of calculation value is larger; the data length of document is less, and result of calculation value is larger.
In order to appear at position in document and the data length of document sorts to result for retrieval according to retrieval participle, the embodiment of the present invention provides a kind of document retrieval method, in this method, after retrieving a plurality of documents that comprise the whole retrieval participles in search key, the data length of the position according to the retrieval participle in search key in a plurality of documents that retrieve and a plurality of documents that retrieve, sorts to a plurality of documents that retrieve; Finally according to ranking results, a plurality of documents that retrieve are returned as result for retrieval.
Referring to Fig. 2, the document retrieval method that the embodiment of the present invention provides, specifically comprises the following steps:
Step 20: a plurality of documents that retrieval comprises the whole retrieval participles in search key;
Here, can be to retrieve according to above-mentioned steps 10-12 a plurality of documents that comprise the whole retrieval participles in search key, also can be to retrieve according to prior art a plurality of documents that comprise the whole retrieval participles in search key, prior art is: receive after retrieval request, the retrieval participle that search key in Analysis deterrmination retrieval request comprises, in indexed file, search the document that comprises all retrieval participles, and using the document as result for retrieval.
Step 21: the data length of the position according to the retrieval participle in described search key in a plurality of documents that retrieve and a plurality of documents that retrieve, sorts to a plurality of documents that retrieve;
Step 22: according to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.
The specific implementation of step 21 can be as follows:
According to predefined file correlation value computing formula, calculate respectively the relevance degree of a plurality of documents that retrieve; Described file correlation value computing formula meets the following conditions: the position of the retrieval participle that described search key comprises in document is more forward, the relevance degree calculating according to the document relevance degree computing formula is larger, the data length of document is less, and the relevance degree calculating according to the document relevance degree computing formula is larger;
According to the size order of the relevance degree of a plurality of documents that calculate, the plurality of document is sorted.
Above-mentioned file correlation value computing formula can adopt above-mentioned formula one and formula two.Certainly; file correlation value computing formula is not limited to above-mentioned formula one and formula two; any formula with following characteristic is all in protection scope of the present invention: the position of the retrieval participle that search key comprises in document is more forward; result of calculation value is larger; the data length of document is less, and result of calculation value is larger.
In practical application, in the situation that file correlation value is identical, can further presses predefined rule, as press the rules such as phonetic, ISN, the document that file correlation value is identical sorts.
Below in conjunction with embodiment, the present invention is specifically described:
Embodiment mono-:
The present embodiment is index process of establishing, specific as follows:
Step 01: to having the field of accurate Search Requirement in document by word participle, obtain one or more retrieval participles, for each retrieval participle creates index;
Step 02: increase additional markers (Term) in index, with the ending of tag field.The text of Term adopts a predefine character END.END is a unallowable instruction digit in character code set, to guarantee to follow normal text generating to repeat;
Step 03: record and preserve the length of this field of each document, the retrieval participle number that this field comprises, is greater than 255 length value and processes as 255, to facilitate storage and to calculate.
Embodiment bis-:
The present embodiment is file retrieval process, specific as follows:
Step 11: the search key in retrieval request is pressed to word participle, obtain N retrieval participle.If relate to the position relationship ending up with field, additionally add END as N+1 retrieval participle;
Step 12: resolve search key and asterisk wildcard wherein, obtain and record the position relationship of respectively retrieving between participle, comprising:
The position relationship of first retrieval participle and document beginning, the position relationship between second retrieval participle and first retrieval participle ..., N the position relationship of retrieving between participle and document ending;
Position relationship can represent by one group of minimum position value and maximum position value, remembers and does (min, max).The minimum value of min is 0, and position is identical, and the maximal value of max is MAX, can be decided to be 255 here.
Step 13: search qualified document from the index creating;
Specific to, find the document that simultaneously comprises above-mentioned N+1 retrieval participle, and then from index file, read the positional value of N retrieval participle in the document, and carry out the concrete coupling of position relationship, require the relative position of this N+1 retrieval participle in document to meet above-mentioned required distance.
Step 14: calculate the relevance degree of the document finding according to relevance degree computing formula, according to relevance degree, the document finding is carried out to descending sort.Relevance degree computing formula adopts above-mentioned formula one and formula two.
Embodiment tri-:
The present embodiment is by enterprise's retrieve application, and the realization that the entry field of < < Ci hai > > is retrieved illustrates embodiment.
The retrieval requirement that < < Ci hai > > entry field is carried out can be found at ad-hoc location and comprise the document of retrieving participle, and according to hit location and hit document length, use above-mentioned rule compositor.
In retrieval request, support asterisk wildcard " " and " * ", wherein " " represents 0 or 1 character, and " * " represents 0 or one or more characters, is supported in a search key and has a plurality of asterisk wildcards simultaneously.
Below the detailed explanation that various types of asterisk wildcards are used:
In retrieving, should mate the position relationship between retrieval participle, also to mate the position relationship between retrieval participle and document beginning and ending.
Before retrieval, need to carry out index process of establishing, specific as follows: entry field to be pressed to word participle, create inverted index.In index, increase additional markers (Term), with the ending of tag field.The text of Term adopts a predefine character END.END is a unallowable instruction digit in character code set, to guarantee to follow normal text generating to repeat; Record and preserve the length of this field of each document, the retrieval participle number that this field comprises, is greater than 255 length value and processes as 255, uses a byte (byte) to preserve field length.
The retrieval " AB*C " of take is below example, describes retrieving:
Step 21: corresponding with index process of establishing, the search key in retrieval request is pressed to word participle, obtain " A ", " B ", " C " three retrieval participles, increase as required an ending END as the 4th retrieval participle.
Step 22: calculate retrieval participle position relationship each other, and the position relationship of retrieval participle between ending up with document beginning, document.
Position relationship represents with a pair of lowest distance value min and maximum range value max, remembers and does (min, max), and min minimum value is 0, and position is identical; The maximal value of max is MAX.Asterisk wildcard " " in retrieval request represents 0 or 1 character, and " * " represents 0 or one or more characters.
Resolve after retrieval request, obtain following distance relation:
Dis(BEGIN,A)=(0,1)
Dis(A,B)=(1,1)
Dis(B,C)=(1,MAX)
Dis(C,FINALITY)=(1,1)
BEGIN represents the beginning of document, and FINALITY represents the ending of document.Dis (X, Y) represents X, Y distance relation between the two, by minor increment and ultimate range, represents.MAX is predefined maximum range value, is decided to be 255 here.
Step 23: search document from inverted index;
First search simultaneously the document that comprises " A ", " B ", " C " three retrieval participles, and then read these retrieval participle and the positional value of END in the document, carry out the concrete coupling of position relationship.
Require the relative position of three retrieval participles in document to meet above-mentioned required distance.The positional value of " A " is necessary for 0 or 1, thereby meets the position relationship starting with document, and " A ", " B ", " C ", " END " position relationship each other also meet above-mentioned required distance.
Step 24: calculate the relevance degree of qualified each document retrieving according to relevance degree computing formula, the document retrieving is sorted according to relevance degree order from high to low, and return to the result after sequence.
Here according to " A ", " B ", the positional value of " C " and the length of document field, carry out the calculating of relevance degree.
The present invention can also be applied in Database Systems, using a record in database as a document, after receiving the retrieval request that comprises search key, search the record that position that record field comprises retrieval participle in search key and participle to be retrieved place meets certain condition.
Referring to Fig. 3, the embodiment of the present invention provides a kind of document retrieving apparatus, and this device comprises:
Request reception unit 30, for receiving the retrieval request that comprises search key;
Condition analysis unit 31, the retrieval participle comprising for definite described search key and the participle to be retrieved condition that residing position need to be satisfied in destination document of this retrieval participle;
File search unit 32, meets the document of described condition for searching the position that comprises definite whole retrieval participles and participle to be retrieved place;
Result is returned to unit 33, for the information of the document finding is returned as result for retrieval.
Further, described condition analysis unit 31 specifically can be used for:
When described participle to be retrieved comprises first retrieval participle that described search key comprises, according to the form of described search key, determine first retrieval position of participle in destination document and required satisfied primary importance relation of reference position of destination document that described search key comprises, using this primary importance relation as described first retrieval participle condition that residing position need to be satisfied in destination document;
Accordingly, described file search unit 32 specifically can be used for:
Search and comprise definite retrieval participle and described first retrieval participle meets the document of described primary importance relation.
Further, described condition analysis unit 31 specifically can be used for:
When described participle to be retrieved comprises last retrieval participle that described search key comprises, according to the form of described search key, determine last the retrieval position of participle in destination document and required satisfied second place relation of end position of destination document that described search key comprises, using this second place relation as described last retrieval participle condition that residing position need to be satisfied in destination document;
Accordingly, described file search unit 32 specifically can be used for:
Search and comprise definite retrieval participle and described last retrieval participle meets the document of described second place relation.
Further, described condition analysis unit 31 specifically can be used for:
According to the form of described search key, determine first retrieval participle position in destination document and required satisfied primary importance relation of reference position of destination document that described search key comprises, and described search key comprise last retrieve the position of participle in destination document and required satisfied second place relation of end position of destination document;
Accordingly, described file search unit 32 specifically can be used for:
Search and comprise definite retrieval participle and described first retrieval participle meets described primary importance relation and described last retrieval participle meets the document of described second place relation.
Further, described condition analysis unit 31 specifically can be used for:
Determine in described search key and whether there is asterisk wildcard before first retrieval participle;
When being defined as being, according to the corresponding relation of predefined asterisk wildcard type and distance value scope, determine the positional distance value that described asterisk wildcard is corresponding, and determine that described primary importance is closed and be: at the distance value between first retrieval participle and the bebinning character of destination document described in destination document within the scope of described distance value;
Be defined as when no, determining that described first retrieval participle position in destination document is closed with the required satisfied primary importance of reference position of destination document and be: first is retrieving the reference position that participle is positioned at destination document.
Further, described condition analysis unit 31 specifically can be used for:
Determine in described search key and whether there is asterisk wildcard after last retrieval participle;
When being defined as being, according to the corresponding relation of predefined asterisk wildcard type and positional distance value, determine the distance value scope that described asterisk wildcard is corresponding, and determine that the described second place is closed and be: at the distance value between last retrieval participle and the ending character of destination document described in destination document within the scope of described distance value;
Be defined as when no, determining that described last retrieval participle position in destination document is closed with the required satisfied second place of end position of destination document and be: last is retrieving the end position that participle is positioned at destination document.
When described asterisk wildcard is asterisk, described distance value scope is to be not less than 0 integer; When described asterisk wildcard is question mark, the distance value scope that described question mark is corresponding is 0 or 1.
Further, described file search unit 32 also can be used for:
Determine that described search key comprises whether respectively retrieve the position relationship of participle in the document finding consistent with the position relationship in described search key;
Further, described result is returned to unit 33 and specifically be can be used for:
What determine that described search key comprises, respectively retrieve position relationship in the document that participle finding when consistent with position relationship in described search key, the information of the document finding is returned as result for retrieval.
Further, this device also comprises:
Sort result unit 34, the data length for the retrieval participle that comprises according to described search key at the position of each document finding and each document of finding, sorts to each document finding;
Accordingly, described result is returned to unit 33 and specifically be can be used for:
According to the ranking results that each document finding is sorted, each document finding is returned as result for retrieval.
Further, described sort result unit 34 specifically can be used for:
According to predefined file correlation value computing formula, calculate respectively the relevance degree of each document finding; Described file correlation value computing formula meets the following conditions: the position of the retrieval participle that described search key comprises in document is more forward, the relevance degree calculating according to the document relevance degree computing formula is larger, the data length of document is less, and the relevance degree calculating according to the document relevance degree computing formula is larger;
According to the size order of the relevance degree of each document calculating, each document is sorted.
Described file correlation value computing formula can be:
scord ( d ) = &Sigma; t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
ExactNorm ( len ) = 1.0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, the positional value of t of pos(t) comprising for described search key retrieval participle in document, and N is the number of the retrieval participle that comprises of described search key.
Referring to Fig. 4, the embodiment of the present invention also provides a kind of document retrieving apparatus, and this device comprises:
Document retrieving unit 40, for retrieving a plurality of documents of the whole retrieval participles that comprise search key;
Sort result unit 41, the data length for the position in a plurality of documents that retrieve and a plurality of documents that retrieve according to the retrieval participle of described search key, sorts to a plurality of documents that retrieve;
Result is returned to unit 42, for the ranking results according to a plurality of documents that retrieve are sorted, a plurality of documents that retrieve is returned as result for retrieval.
Further, described sort result unit 41 specifically can be used for:
According to predefined file correlation value computing formula, calculate respectively the relevance degree of a plurality of documents that retrieve; Described file correlation value computing formula meets the following conditions: the position of the retrieval participle that described search key comprises in document is more forward, the relevance degree calculating according to the document relevance degree computing formula is larger, the data length of document is less, and the relevance degree calculating according to the document relevance degree computing formula is larger;
According to the size order of the relevance degree of a plurality of documents that calculate, the plurality of document is sorted.
Described file correlation value computing formula can be:
scord ( d ) = &Sigma; t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
ExactNorm ( len ) = 1.0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, the positional value of t of pos(t) comprising for described search key retrieval participle in document, and N is the number of the retrieval participle that comprises of described search key.
Further, described document retrieving unit 40 comprises:
Request reception unit, for receiving the retrieval request that comprises search key;
Condition analysis unit, the retrieval participle comprising for definite described search key and the participle to be retrieved condition that residing position need to be satisfied in destination document of this retrieval participle;
File search unit, meets the document of described condition for searching the position that comprises definite whole retrieval participles and participle to be retrieved place;
Result is returned to unit, for the information of a plurality of documents that find is returned as result for retrieval.
Further, described condition analysis unit is used for:
When described participle to be retrieved comprises first retrieval participle that described search key comprises, according to the form of described search key, determine first retrieval position of participle in destination document and required satisfied primary importance relation of reference position of destination document that described search key comprises;
Accordingly, described file search unit is used for:
Search and comprise definite retrieval participle and described first retrieval participle meets the document of described primary importance relation.
Further, described condition analysis unit is used for:
When described participle to be retrieved comprises last retrieval participle that described search key comprises, according to the form of described search key, determine last the retrieval position of participle in destination document and required satisfied second place relation of end position of destination document that described search key comprises;
Accordingly, described file search unit is used for:
Search and comprise definite retrieval participle and described last retrieval participle meets the document of described second place relation.
Further, described condition analysis unit specifically can be used for:
According to the form of described search key, determine first retrieval participle position in destination document and required satisfied primary importance relation of reference position of destination document that described search key comprises, and described search key comprise last retrieve the position of participle in destination document and required satisfied second place relation of end position of destination document;
Accordingly, described file search unit specifically can be used for:
Search and comprise definite retrieval participle and described first retrieval participle meets described primary importance relation and described last retrieval participle meets the document of described second place relation.
Further, described condition analysis unit specifically can be used for:
Determine in described search key and whether there is asterisk wildcard before first retrieval participle;
When being defined as being, according to the corresponding relation of predefined asterisk wildcard type and distance value scope, determine the positional distance value that described asterisk wildcard is corresponding, and determine that described primary importance is closed and be: at the distance value between first retrieval participle and the bebinning character of destination document described in destination document within the scope of described distance value;
Be defined as when no, determining that described first retrieval participle position in destination document is closed with the required satisfied primary importance of reference position of destination document and be: first is retrieving the reference position that participle is positioned at destination document.
Further, described condition analysis unit specifically can be used for:
Determine in described search key and whether there is asterisk wildcard after last retrieval participle;
When being defined as being, according to the corresponding relation of predefined asterisk wildcard type and positional distance value, determine the distance value scope that described asterisk wildcard is corresponding, and determine that the described second place is closed and be: at the distance value between last retrieval participle and the ending character of destination document described in destination document within the scope of described distance value;
Be defined as when no, determining that described last retrieval participle position in destination document is closed with the required satisfied second place of end position of destination document and be: last is retrieving the end position that participle is positioned at destination document.
When described asterisk wildcard is asterisk, described distance value scope is to be not less than 0 integer; When described asterisk wildcard is question mark, the distance value scope that described question mark is corresponding is 0 or 1.
Further, described file search unit also can be used for:
Determine that described search key comprises whether respectively retrieve the position relationship of participle in the document finding consistent with the position relationship in described search key;
Accordingly, described result is returned to unit and specifically be can be used for:
What determine that described search key comprises, respectively retrieve position relationship in the document that participle finding when consistent with position relationship in described search key, the information of the document finding is returned as result for retrieval.
To sum up, beneficial effect of the present invention comprises:
In the scheme that the embodiment of the present invention provides, receive after the retrieval request that comprises search key, determine retrieval participle that described search key comprises and the participle to be retrieved condition that residing position need to be satisfied in destination document in this retrieval participle, search the document that the position that comprises definite whole retrieval participles and participle to be retrieved place meets described condition, and the information of the document finding is returned as result for retrieval.Visible, the position that adopts the present invention can retrieve retrieval participle meets the document of certain condition, also can retrieve the document that retrieval participle occurs at certain position, and then make result for retrieval more accurate, better meets user's request.
In the scheme that the embodiment of the present invention provides, after retrieving a plurality of documents that comprise the whole retrieval participles in search key, the data length of the position according to the retrieval participle in described search key in a plurality of documents that retrieve and a plurality of documents that retrieve, a plurality of documents that retrieve are sorted, and according to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.Visible, adopt the present invention, can appear at position in document and the data length of document sorts to result for retrieval according to retrieval participle, and then make the ranking results of search file more accurate, better meet user's request.
The present invention is with reference to describing according to process flow diagram and/or the block scheme of the method for the embodiment of the present invention, equipment (system) and computer program.Should understand can be in computer program instructions realization flow figure and/or block scheme each flow process and/or the flow process in square frame and process flow diagram and/or block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, the instruction of carrying out by the processor of computing machine or other programmable data processing device is produced for realizing the device in the function of flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.
These computer program instructions also can be stored in energy vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work, the instruction that makes to be stored in this computer-readable memory produces the manufacture that comprises command device, and this command device is realized the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make to carry out sequence of operations step to produce computer implemented processing on computing machine or other programmable devices, thereby the instruction of carrying out is provided for realizing the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame on computing machine or other programmable devices.
Although described the preferred embodiments of the present invention, once those skilled in the art obtain the basic creative concept of cicada, can make other change and modification to these embodiment.So claims are intended to all changes and the modification that are interpreted as comprising preferred embodiment and fall into the scope of the invention.
Obviously, those skilled in the art can carry out various changes and modification and not depart from the spirit and scope of the present invention the present invention.Like this, if within of the present invention these are revised and modification belongs to the scope of the claims in the present invention and equivalent technologies thereof, the present invention is also intended to comprise these changes and modification interior.

Claims (8)

1. a document retrieval method, is characterized in that, the method comprises:
The retrieval request that reception comprises search key;
Determine the retrieval participle that described search key comprises, and determine the participle to be retrieved condition that residing position need to be satisfied in destination document in this retrieval participle;
Search the document that the position that comprises definite retrieval participle and participle to be retrieved place meets described condition;
The information of a plurality of documents that find is returned as result for retrieval;
The data length of the position according to the retrieval participle in described search key in a plurality of documents that retrieve and a plurality of documents that retrieve, sorts to a plurality of documents that retrieve;
According to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.
2. the method for claim 1, is characterized in that, described according to the retrieval participle in described search key the data length of the position in a plurality of documents that retrieve and a plurality of documents of retrieving, a plurality of documents that retrieve are sorted and are comprised:
According to predefined file correlation value computing formula, calculate respectively the relevance degree of a plurality of documents that retrieve; Described file correlation value computing formula meets the following conditions: the position of the retrieval participle that described search key comprises in document is more forward, the relevance degree calculating according to the document relevance degree computing formula is larger, the data length of document is less, and the relevance degree calculating according to the document relevance degree computing formula is larger;
According to the size order of the relevance degree of a plurality of documents that calculate, the plurality of document is sorted.
3. method as claimed in claim 2, is characterized in that, described file correlation value computing formula is:
scord ( d ) = &Sigma; t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
ExactNorm ( len ) = 1.0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, the positional value of t of pos(t) comprising for described search key retrieval participle in document, and N is the number of the retrieval participle that comprises of described search key.
4. the method for claim 1, it is characterized in that, when described participle to be retrieved comprises first retrieval participle that described search key comprises, describedly determine that the participle to be retrieved residing position in destination document in this retrieval participle need to satisfied condition comprise:
According to the form of described search key, determine first retrieval position of participle in destination document and required satisfied primary importance relation of reference position of destination document that described search key comprises;
Describedly search the document that the position that comprises definite retrieval participle and participle to be retrieved place meets described condition and comprise:
Search and comprise definite retrieval participle and described first retrieval participle meets the document of described primary importance relation.
5. the method for claim 1, it is characterized in that, when described participle to be retrieved comprises last retrieval participle that described search key comprises, describedly determine that the participle to be retrieved residing position in destination document in this retrieval participle need to satisfied condition comprise:
According to the form of described search key, determine last the retrieval position of participle in destination document and required satisfied second place relation of end position of destination document that described search key comprises;
Describedly search the document that the position that comprises definite retrieval participle and participle to be retrieved place meets described condition and comprise:
Search and comprise definite retrieval participle and described last retrieval participle meets the document of described second place relation.
6. a document retrieving apparatus, is characterized in that, this device comprises:
Document retrieving unit, for receiving the retrieval request that comprises search key; Determine the retrieval participle that described search key comprises, and determine the participle to be retrieved condition that residing position need to be satisfied in destination document in this retrieval participle; Search the document that the position that comprises definite retrieval participle and participle to be retrieved place meets described condition; The information of a plurality of documents that find is returned as result for retrieval;
Sort result unit, the data length for the position in a plurality of documents that retrieve and a plurality of documents that retrieve according to the retrieval participle of described search key, sorts to a plurality of documents that retrieve;
Result is returned to unit, for the ranking results according to a plurality of documents that retrieve are sorted, a plurality of documents that retrieve is returned as result for retrieval.
7. device as claimed in claim 6, is characterized in that, described sort result unit is used for:
According to predefined file correlation value computing formula, calculate respectively the relevance degree of a plurality of documents that retrieve; Described file correlation value computing formula meets the following conditions: the position of the retrieval participle that described search key comprises in document is more forward, the relevance degree calculating according to the document relevance degree computing formula is larger, the data length of document is less, and the relevance degree calculating according to the document relevance degree computing formula is larger;
According to the size order of the relevance degree of a plurality of documents that calculate, the plurality of document is sorted.
8. device as claimed in claim 7, is characterized in that, described file correlation value computing formula is:
scord ( d ) = &Sigma; t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
ExactNorm ( len ) = 1.0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, the positional value of t of pos(t) comprising for described search key retrieval participle in document, and N is the number of the retrieval participle that comprises of described search key.
CN201010621819.1A 2010-12-27 2010-12-27 Document retrieval method and device Expired - Fee Related CN102567420B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010621819.1A CN102567420B (en) 2010-12-27 2010-12-27 Document retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010621819.1A CN102567420B (en) 2010-12-27 2010-12-27 Document retrieval method and device

Publications (2)

Publication Number Publication Date
CN102567420A CN102567420A (en) 2012-07-11
CN102567420B true CN102567420B (en) 2014-03-12

Family

ID=46412849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010621819.1A Expired - Fee Related CN102567420B (en) 2010-12-27 2010-12-27 Document retrieval method and device

Country Status (1)

Country Link
CN (1) CN102567420B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022794A (en) * 2015-06-26 2015-11-04 广州时韵信息科技有限公司 Method and apparatus for fast searching for required article contents
CN107346325A (en) * 2016-05-04 2017-11-14 中国石油集团长城钻探工程有限公司 Information query method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
CN101030217A (en) * 2007-03-22 2007-09-05 华中科技大学 Method for indexing and acquiring semantic net information
CN101206672A (en) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 Commercial articles searching non result intelligent processing system and method
CN101344890A (en) * 2008-08-22 2009-01-14 清华大学 Grading method for information retrieval document based on viewpoint searching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
CN101030217A (en) * 2007-03-22 2007-09-05 华中科技大学 Method for indexing and acquiring semantic net information
CN101206672A (en) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 Commercial articles searching non result intelligent processing system and method
CN101344890A (en) * 2008-08-22 2009-01-14 清华大学 Grading method for information retrieval document based on viewpoint searching

Also Published As

Publication number Publication date
CN102567420A (en) 2012-07-11

Similar Documents

Publication Publication Date Title
CN102567421B (en) Document retrieval method and device
CN110162750B (en) Text similarity detection method, electronic device and computer readable storage medium
US7996379B1 (en) Document ranking using word relationships
CN108520002A (en) Data processing method, server and computer storage media
CN103020213B (en) Method and system for searching non-structural electronic document with obvious category classification
CN103425687A (en) Retrieval method and system based on queries
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN102855263A (en) Method and device for aligning sentences in bilingual corpus
CN107844493B (en) File association method and system
CN108875065B (en) Indonesia news webpage recommendation method based on content
EP2631815A1 (en) Method and device for ordering search results, method and device for providing information
CN109492150B (en) Reverse nearest neighbor query method and device based on semantic track big data
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN107679208A (en) A kind of searching method of picture, terminal device and storage medium
CN111078835A (en) Resume evaluation method and device, computer equipment and storage medium
WO2023010427A1 (en) Systems and methods generating internet-of-things-specific knowledge graphs, and search systems and methods using such graphs
CN108363558A (en) Number of machines level data comparative approach towards big data processing
US20170185671A1 (en) Method and apparatus for determining similar document set to target document from a plurality of documents
Sales et al. A compositional-distributional semantic model for searching complex entity categories
Wahle et al. Deterministic binary vectors for efficient automated indexing of medline/pubmed abstracts
CN102567420B (en) Document retrieval method and device
US20140181124A1 (en) Method, apparatus, system and storage medium having computer executable instrutions for determination of a measure of similarity and processing of documents
CN110008407B (en) Information retrieval method and device
Balamurugan et al. A Trend Analysis of Information Retrieval Models.
Yadav et al. Efficient methods to generate inverted indexes for ir

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220620

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: FOUNDER APABI TECHNOLOGY Ltd.

Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: FOUNDER APABI TECHNOLOGY Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140312