CN102567420A - Document retrieval method and device - Google Patents

Document retrieval method and device Download PDF

Info

Publication number
CN102567420A
CN102567420A CN2010106218191A CN201010621819A CN102567420A CN 102567420 A CN102567420 A CN 102567420A CN 2010106218191 A CN2010106218191 A CN 2010106218191A CN 201010621819 A CN201010621819 A CN 201010621819A CN 102567420 A CN102567420 A CN 102567420A
Authority
CN
China
Prior art keywords
retrieval
document
participle
documents
search key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010106218191A
Other languages
Chinese (zh)
Other versions
CN102567420B (en
Inventor
童征宇
徐剑波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Founder Apabi Technology Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201010621819.1A priority Critical patent/CN102567420B/en
Publication of CN102567420A publication Critical patent/CN102567420A/en
Application granted granted Critical
Publication of CN102567420B publication Critical patent/CN102567420B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention discloses a document retrieval method and device, relating to the field of computer information processing. The method and the device are used for solving the problem that the retrieval results can not be sequenced according to the positions of retrieval segmented words in the documents and the data length of the documents. The method comprises the following steps: sequencing a plurality of retrieved documents according to the positions of the retrieval segmented words in the retrieval keywords in the retrieved documents and the data length of the retrieved documents after retrieving a plurality of documents containing the retrieval segmented words in the retrieval keywords; and returning the retrieved documents as the retrieval results according to the sequencing result of the retrieved documents. Visibly, the retrieval results can be sequenced according to the positions of the retrieval segmented words in the documents and the data length of the documents by adopting the method and the device.

Description

Document retrieval method and device
Technical field
The present invention relates to the computer information processing field, relate in particular to a kind of document retrieval method and device.
Background technology
Full-text search is meant that text retrieval system passes through each speech in the scanned document; An index entry set up in each speech; Indicate number of times and position that this speech occurs in document; When the user submitted retrieval request to, text retrieval system returned to the result who searches user's retrieval mode with regard to searching according to the index file of setting up in advance according to certain sortord.In the practical application, the document that text retrieval system is handled possibly comprise a plurality of fields, like title, author, text etc.
Concrete; After the user submitted retrieval request to, text retrieval system was analyzed the retrieval participle that the search key in the deterministic retrieval request comprises, and the retrieval participle is meant that search key is carried out character divides the participle that the back forms; Specifically how search key being carried out character divides; Have different dividing mode according to different algorithms, for example, can each character in the search key be divided into a retrieval participle; Also can per two characters in the search key be divided into a retrieval participle, or the like; Then, search the document that comprises all retrieval participles in the indexed file, and the information of the document is offered the user as result for retrieval.In the phrase retrieval is to comprise in the search key under the situation of a plurality of retrieval participles; After searching the documents that comprise all retrieval participles, also need further mate to the position relation of a plurality of retrieval participles in the document, with the position relation of confirming these a plurality of retrieval participles whether with retrieval request in the position relationship consistency of a plurality of retrieval participles of comprising; If coupling is consistent; Then the information with the document offers the user as result for retrieval, otherwise, not with the document as result for retrieval.For example, the retrieval participle that comprises in the search key comprises " participle " and " rule ", and the position relation of these two retrieval participles is for adjacent; Be not comprise other character between these two the retrieval participles; After finding the document that comprises " participle " and " rule ", the position relation of " participle " in the document and " rule " is mated, if " participle " is adjacent with " rule " in the document; Be to comprise " word segmentation regulation " in the document; Then the information with the document offers the user as result for retrieval, otherwise, not with the document as result for retrieval.
Search obtain a plurality of result for retrieval after, need a plurality of result for retrieval be sorted according to certain rule, according to clooating sequence a plurality of result for retrieval are offered the user at last.For various text retrieval systems, the demand whether ordering of result for retrieval meets the user is to estimate its good and bad key factor.At present; Text retrieval system generally uses vector space model that result for retrieval is sorted, and is concrete, and this model is according to word frequency (Team Frequency; TF)/inverted entry frequency (Invert Document Frequency; IDF), calculate the quantization weight value of retrieval participle in document, each document is sorted according to the quantization weight value of each document that calculates.TF is meant the frequency that a retrieval participle occurs in document, it describes the importance of this retrieval participle in one piece of particular document; IDF is the inverted entry frequency; What it was described is the frequency that the retrieval participle occurs in all documents; Promptly should retrieve the general importance of participle, like " I ", " what " these speech almost all can occur in all documents; Even so the frequency that these speech occur in one piece of specific document is very high, neither be very important.Generally speaking be exactly that the priority of one piece of particular document is directly proportional with the TF of retrieval participle, is inversely proportional to IDF.
In realizing process of the present invention, the inventor finds to exist in the prior art following technical matters:
In the existing result for retrieval sortord; TF and IDF according to the retrieval participle sort to result for retrieval; The position and the data length of document that how appear in the document according to the retrieval participle sort to result for retrieval, also do not have concrete implementation at present.
Summary of the invention
The embodiment of the invention provides a kind of document retrieval method and device, is used for solving appearing at the problem that the data length of position and the document of document sorts to result for retrieval according to the retrieval participle.
A kind of document retrieval method, this method comprises:
After retrieving a plurality of documents that comprise the whole retrieval participles in the search key; According to the data length of the position of the retrieval participle in the said search key in a plurality of documents that retrieve, a plurality of documents that retrieve are sorted with a plurality of documents that retrieve;
According to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.
A kind of document retrieving apparatus, this device comprises:
Retrieval unit is used for retrieving a plurality of documents of the whole retrieval participles that comprise search key;
Sequencing unit is used for according to the position of retrieval participle in a plurality of documents that retrieve of said search key and the data length of a plurality of documents that retrieve a plurality of documents that retrieve being sorted;
The result returns the unit, is used for the ranking results that sorts according to a plurality of documents that retrieve, and a plurality of documents that retrieve are returned as result for retrieval.
In the scheme that the embodiment of the invention provides; After retrieving a plurality of documents that comprise the whole retrieval participles in the search key; According to the data length of the position of the retrieval participle in the said search key in a plurality of documents that retrieve with a plurality of documents that retrieve; A plurality of documents to retrieving sort, and according to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.It is thus clear that, adopt the present invention, can result for retrieval be sorted according to position and the data length of document that the retrieval participle appears in the document, and then make that the ranking results of search file is more accurate, better meet user's request.
Description of drawings
The method flow synoptic diagram that Fig. 1 provides for the embodiment of the invention;
Another method schematic flow sheet that Fig. 2 provides for the embodiment of the invention;
The document retrieving apparatus structural representation that Fig. 3 provides for the embodiment of the invention;
Another document retrieving apparatus structural representation that Fig. 4 provides for the embodiment of the invention.
Embodiment
In order to retrieve the document that the retrieval participle occurs at desired location; The embodiment of the invention provides a kind of document retrieval method; In this method; After receiving retrieval request; Need one or more retrieval participles that search key comprised (the being specific retrieval participle) condition that residing position need be satisfied in destination document in the deterministic retrieval request, search the document that this condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place then, at last the information of the document that finds is returned as result for retrieval.
Referring to Fig. 1, the document retrieval method that the embodiment of the invention provides specifically may further comprise the steps:
Step 10: receive the retrieval request that comprises search key;
Here, search key is meant the keyword that is used to retrieve of the external world (for example user) input;
Step 11: confirm retrieval participle that said search key comprises and the specific retrieval participle condition that residing position need be satisfied in destination document in this retrieval participle;
Step 12: search the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place;
Step 13: the information of the document that finds is returned as result for retrieval.
In the step 11, when said specific retrieval participle comprises first retrieval participle that search key comprises, confirm this first retrieval participle condition that residing position need be satisfied in destination document, its concrete realization can be following:
Form according to search key; The position of first retrieval participle that the deterministic retrieval keyword is comprised in destination document and the required satisfied primary importance relation of reference position of destination document concern first retrieval participle condition that residing position need be satisfied in destination document as search key comprised with this primary importance.
Concrete, the concrete realization of above-mentioned definite primary importance relation can be following:
At first, confirm in search key, whether to have asterisk wildcard before first retrieval participle; Confirming as when being; Corresponding relation according to predefined asterisk wildcard type and distance value scope; Confirm the corresponding position distance value of first retrieval participle asterisk wildcard before, and confirm that said primary importance relation is: the distance value between the bebinning character of first retrieval participle and destination document described in the destination document is in said distance value scope; When confirming as not, confirm that said primary importance relation is: first retrieval participle is positioned at the reference position of destination document.
Certainly, above-mentioned form according to search key confirms that the realization of primary importance relation is not limited to the mode of above-mentioned asterisk wildcard, and any other form according to search key confirms that the mode of primary importance relation is all in protection scope of the present invention.
In the step 11, when said specific retrieval participle comprises last retrieval participle that search key comprises, confirm this last retrieval participle condition that residing position need be satisfied in destination document, its concrete realization can be following:
Form according to search key; The position of last retrieval participle that the deterministic retrieval keyword is comprised in destination document and the required satisfied second place relation of end position of destination document concern last the retrieval participle condition that residing position need be satisfied in destination document as search key comprised with this second place.
Concrete, the concrete realization of above-mentioned definite second place relation can be following:
Confirm in search key, whether to have asterisk wildcard after last retrieval participle; Confirming as when being; Corresponding relation according to predefined asterisk wildcard type and position distance value; Confirm the corresponding distance value scope of last retrieval participle asterisk wildcard afterwards, and confirm that said second place relation is: the distance value between the ending character of last retrieval participle and destination document described in the destination document is in said distance value scope; When confirming as not, confirm that said second place relation is: last retrieval participle is positioned at the end position of destination document.
Certainly, above-mentioned form according to search key confirms that the realization of second place relation is not limited to the mode of above-mentioned asterisk wildcard, and any other form according to search key confirms that the mode of second place relation is all in protection scope of the present invention.
Illustrate, when said asterisk wildcard was asterisk, said distance value scope was to be not less than 0 integer; When said asterisk wildcard was question mark, the distance value scope that said question mark is corresponding was 0 or 1.
In step 10, receive and comprise before the retrieval request of search key, can set up index file, comprise retrieval participle and the positional information of this retrieval participle in corresponding document that each document comprises in this index file to one or more documents;
Accordingly, the concrete realization of step 12 can be divided into following three kinds of situation:
First kind, corresponding to specific retrieval participle comprise in the search key first the retrieval participle situation, specific as follows:
At first, confirm to comprise the document of the whole retrieval participles in the search key according to said index file; Then; From said index file, read first retrieval participle positional information in the document in the search key, confirm according to this positional information whether the reference position of this first position of retrieval participle in the document and the document satisfies said primary importance and concern; If; Then the document is confirmed as the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place that finds; Otherwise, the document is not confirmed as the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place that finds.
Second kind, corresponding to specific retrieval participle comprise in the search key last the retrieval participle situation, specific as follows:
At first, confirm to comprise the document of the whole retrieval participles in the search key according to said index file; Then; From said index file, read last the retrieval participle positional information in the document in the said search key, confirm according to this positional information whether the reference position of this last retrieval participle position in the document and the document satisfies the said second place and concern; If; Then the document is confirmed as the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place that finds; Otherwise, the document is not confirmed as the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place that finds.
The third comprises the situation of first and last the retrieval participle in the search key corresponding to specific retrieval participle, and is specific as follows:
At first, confirm to comprise the document of the whole retrieval participles in the search key according to said index file; Then; From said index file, read first retrieval participle and the positional information of last retrieval participle in the document in the search key; Whether the reference position of confirming this first retrieval participle position and the document in the document according to the positional information that reads satisfies said primary importance relation, and whether the end position of this last retrieval participle position and the document in the document satisfies the said second place and concern; If; Then the document is confirmed as the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place that finds; Otherwise, the document is not confirmed as the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place that finds.
Certainly; Specific retrieval participle except can comprise in the search key first with last the retrieval participle; Also can comprise any other retrieval participle in the search key; In this case; The concrete realization of confirming this specific retrieval participle condition that residing position need be satisfied in destination document in the step 11 can be following: according to the form of search key, confirm the position of this specific retrieval participle in destination document and the reference position and/or the required satisfied position of the end position relation of destination document, this position is concerned as this specific retrieval participle condition that residing position need be satisfied in destination document.Accordingly, the concrete realization of step 12 can be following: at first, confirm to comprise the document of the whole retrieval participles in the search key according to index file; Then, from index file, read the positional information of this specific retrieval participle in the document, confirm according to this positional information whether reference position of this specific retrieval participle position and the document in the document and/or end position satisfy corresponding position relation; If; Then the document is confirmed as the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place that finds; Otherwise, the document is not confirmed as the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place that finds.
Preferable, between step 12 and step 13, this method further comprises:
Confirm whether consistent the position relation of participle in the document that finds of respectively retrieving that said search key comprises concerns with position in said search key;
Accordingly, in the step 13, during with position relationship consistency in said search key, the information of the document that finds is returned as result for retrieval in the position relation in the document that participle finding respectively retrieved of confirming that said search key comprised.
Preferable, between step 12 and step 13, can each document that find be sorted according to the data length of the position of retrieval participle in each document that finds that search key comprised and each document that finds; Accordingly, in the step 13,, each document that finds is returned as result for retrieval according to the ranking results that each document that finds is sorted.
Above-mentioned data length according to the position of retrieval participle in each document that finds that said search key comprised and each document that finds sorts to each document that finds, and its concrete realization can be following:
At first, according to predefined file correlation value computing formula, calculate the relevance degree of each document that finds respectively; Said file correlation value computing formula meets the following conditions: the position of retrieval participle in document that said search key comprised is forward more; The relevance degree that calculates according to the document relevance degree computing formula is big more; The data length of document is more little, and the relevance degree that calculates according to the document relevance degree computing formula is big more;
Then, the size order according to the relevance degree of each document that calculates sorts each document.
Said file correlation value computing formula can comprise:
Formula one: Scord ( d ) = Σ t = 1 N 1 2 * Pos ( t ) * 1.0 + ExactNorm ( Len ) 2 ;
Formula two: ExactNorm ( Len ) = 1.0 Len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, and pos (t) is the positional value of t retrieval participle in document that said search key comprised, and N is the number of the retrieval participle that said search key comprised.
Certainly; File correlation value computing formula is not limited to above-mentioned formula one and formula two; Any formula with following characteristic is all in protection scope of the present invention: the position of retrieval participle in document that search key comprised is forward more; The result of calculation value is big more, and the data length of document is more little, and the result of calculation value is big more.
In order can result for retrieval to be sorted according to position and the data length of document that the retrieval participle appears in the document; The embodiment of the invention provides a kind of document retrieval method; In this method; After retrieving a plurality of documents that comprise the whole retrieval participles in the search key,, a plurality of documents that retrieve are sorted according to the data length of the position of the retrieval participle in the search key in a plurality of documents that retrieve with a plurality of documents that retrieve; Last a plurality of documents that will retrieve according to ranking results return as result for retrieval.
Referring to Fig. 2, the document retrieval method that the embodiment of the invention provides specifically may further comprise the steps:
Step 20: retrieval comprises a plurality of documents of the whole retrieval participles in the search key;
Here; Can be to retrieve a plurality of documents that comprise the whole retrieval participles in the search key according to above-mentioned steps 10-12; Also can be to retrieve a plurality of documents that comprise the whole retrieval participles in the search key according to prior art, prior art be: after receiving retrieval request, analyze the retrieval participle that the search key in the deterministic retrieval request comprises; Search in the indexed file comprise all the retrieval participles document, and with the document as result for retrieval.
Step 21:, a plurality of documents that retrieve are sorted according to the data length of the position of the retrieval participle in the said search key in a plurality of documents that retrieve with a plurality of documents that retrieve;
Step 22:, a plurality of documents that retrieve are returned as result for retrieval according to the ranking results that a plurality of documents that retrieve are sorted.
The concrete implementation of step 21 can be following:
According to predefined file correlation value computing formula, calculate the relevance degree of a plurality of documents that retrieve respectively; Said file correlation value computing formula meets the following conditions: the position of retrieval participle in document that said search key comprised is forward more; The relevance degree that calculates according to the document relevance degree computing formula is big more; The data length of document is more little, and the relevance degree that calculates according to the document relevance degree computing formula is big more;
Size order according to the relevance degree of a plurality of documents that calculate should sort by a plurality of documents.
Above-mentioned file correlation value computing formula can adopt above-mentioned formula one and formula two.Certainly; File correlation value computing formula is not limited to above-mentioned formula one and formula two; Any formula with following characteristic is all in protection scope of the present invention: the position of retrieval participle in document that search key comprised is forward more; The result of calculation value is big more, and the data length of document is more little, and the result of calculation value is big more.
In the practical application, under the identical situation of file correlation value, can further press predefined rule, as press rules such as phonetic, ISN, the document that the file correlation value is identical sorts.
Below in conjunction with embodiment the present invention is specified:
Embodiment one:
Present embodiment is that process set up in index, and is specific as follows:
Step 01: field that accurate Search Requirement is arranged in the document by the word participle, is obtained one or more retrieval participles, is that index created in each retrieval participle;
Step 02: in index, increase additional markers (Term), with the ending of tag field.The text of Term adopts a predefine character END.END is a unallowable instruction digit in the character code set, to guarantee to follow normal text generating to repeat;
Step 03: write down and preserve the length of this field of each document, i.e. the retrieval participle number that comprises of this field, the length value greater than 255 is handled as 255, with convenient storage with calculate.
Embodiment two:
Present embodiment is the file retrieval process, and is specific as follows:
Step 11: the search key in the retrieval request is pressed the word participle, obtain N retrieval participle.If relate to the position relation that ends up with field, extra interpolation END is as N+1 retrieval participle;
Step 12: resolve search key and asterisk wildcard wherein, obtain and write down the position of respectively retrieving between the participle concerning, comprising:
The position relation that first retrieval participle and document start, the position relation between second retrieval participle and first retrieval participle ..., N the position relation of retrieving between participle and the document ending;
Position relation can represent with one group of minimum position value and maximum position value, remember and do (min, max).The minimum value of min is 0, and promptly the position is identical, and the maximal value of max is MAX, can be decided to be 255 here.
Step 13: from the index of creating, search qualified document;
Specific to; Find the document that comprises above-mentioned N+1 retrieval participle simultaneously; And then from index file, read the positional value of N retrieval participle in the document, and carry out the concrete coupling of position relation, require the relative position of this N+1 retrieval participle in document to meet above-mentioned required distance.
Step 14: the relevance degree of the document that calculating finds according to the relevance degree computing formula, carry out descending sort according to relevance degree to the document that finds.The relevance degree computing formula adopts above-mentioned formula one and formula two.
Embodiment three:
Present embodiment is explained embodiment through in enterprise's retrieve application to the realization that the entry field of " Ci hai " is retrieved.
The retrieval requirement that " Ci hai " entry field is carried out can be found at ad-hoc location and comprise the document of retrieving participle, and according to hit location and hit document length, uses above-mentioned rule compositor.
Support in the retrieval request asterisk wildcard "? " " * ", wherein "? " Represent 0 or 1 character, and " * " represents 0 or one or more characters, is supported in the search key to have a plurality of asterisk wildcards simultaneously.
Below be the illustrated in detail that various types of asterisk wildcards are used:
In retrieving, should mate the position relation between the retrieval participle, also to mate the position relation between beginning of retrieval participle and document and the ending.
Before retrieval, need carry out index and set up process, specific as follows: as entry field to be pressed the word participle, create inverted index.In index, increase additional markers (Term), with the ending of tag field.The text of Term adopts a predefine character END.END is a unallowable instruction digit in the character code set, to guarantee to follow normal text generating to repeat; Record is also preserved the length of this field of each document, i.e. the retrieval participle number that comprises of this field, and the length value greater than 255 is handled as 255, uses a byte (byte) to preserve field length.
Be example with retrieval "? AB*C " below, the description retrieving:
Step 21: to set up process corresponding with index, and the search key in the retrieval request is pressed the word participle, obtains " A ", " B ", " C " three retrievals participle, increases an ending END as required as the 4th retrieval participle.
Step 22: calculate retrieval participle position relation each other, and the retrieval participle concerns with position between document beginning, the document ending.
Position relation representes with a pair of lowest distance value min and maximum range value max, remember and do (min, max), the min minimum value is 0, promptly the position is identical; The maximal value of max is MAX.In the retrieval request asterisk wildcard "? " Represent 0 or 1 character, " * " represents 0 or one or more characters.
After resolving retrieval request, obtain following distance relation:
Dis(BEGIN,A)=(0,1)
Dis(A,B)=(1,1)
Dis(B,C)=(1,MAX)
Dis(C,FINALITY)=(1,1)
BEGIN representes the beginning of document, and FINALITY representes the ending of document.Dis (X, Y) represent with minor increment and ultimate range by expression X, Y distance relation between the two.MAX is predefined maximum range value, is decided to be 255 here.
Step 23: from inverted index, search document;
At first search simultaneously the document that comprises " A ", " B ", " C " three retrievals participle, and then read these retrieval participle and the positional value of END in the document, carry out the concrete coupling of position relation.
Require the relative position of three retrieval participles in document to meet above-mentioned required distance.The positional value of " A " is necessary for 0 or 1, thereby satisfies the position relation that starts with document, and " A ", " B ", " C ", " END " position relation each other also satisfies above-mentioned required distance.
Step 24: calculate the relevance degree of qualified each document retrieve according to the relevance degree computing formula, the document that retrieves according to relevance degree rank order from high to low, and is returned the result after the ordering.
Here according to " A ", " B ", the positional value of " C " and the length of document field, carry out the calculating of relevance degree.
The present invention can also be applied in the Database Systems; With a record in the database as a document; After receiving the retrieval request that comprises search key, search the record that certain condition is satisfied in position that record field comprises retrieval participle and the specific retrieval participle place in the search key.
Referring to Fig. 3, the embodiment of the invention provides a kind of document retrieving apparatus, and this device comprises:
Request receiving element 30 is used to receive the retrieval request that comprises search key;
Condition analysis unit 31 is used for confirming the retrieval participle that said search key is comprised and the specific retrieval participle condition that residing position need be satisfied in destination document of this retrieval participle;
File search unit 32 is used to search the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place;
The result returns unit 33, is used for the information of the document that finds is returned as result for retrieval.
Further, said condition analysis unit 31 specifically can be used for:
When said specific retrieval participle comprises first retrieval participle that said search key comprises; Form according to said search key; Confirm the position of first retrieval participle in destination document and the required satisfied primary importance relation of reference position of destination document that said search key comprises, this primary importance relation is retrieved participle condition that residing position need be satisfied in destination document as said first;
Accordingly, said file search unit 32 specifically can be used for:
Search and comprise the document that definite retrieval participle and said first retrieval participle satisfy said primary importance relation.
Further, said condition analysis unit 31 specifically can be used for:
When said specific retrieval participle comprises last retrieval participle that said search key comprises; Form according to said search key; Confirm the position of last retrieval participle in destination document and the required satisfied second place relation of end position of destination document that said search key comprises, this second place is concerned as said last retrieval participle condition that residing position need be satisfied in destination document;
Accordingly, said file search unit 32 specifically can be used for:
Search and comprise the document that definite retrieval participle and said last retrieval participle satisfy said second place relation.
Further, said condition analysis unit 31 specifically can be used for:
Form according to said search key; Confirm the position of first retrieval participle in destination document and the required satisfied primary importance relation of reference position of destination document that said search key comprises, and the position of last retrieval participle that said search key comprised in destination document and the required satisfied second place relation of end position of destination document;
Accordingly, said file search unit 32 specifically can be used for:
Search and comprise definite retrieval participle and said first retrieval participle and satisfy the document that said primary importance relation and said last retrieval participle satisfy said second place relation.
Further, said condition analysis unit 31 specifically can be used for:
Confirm in said search key, whether to have asterisk wildcard before first retrieval participle;
Confirming as when being; Corresponding relation according to predefined asterisk wildcard type and distance value scope; Confirm the position distance value that said asterisk wildcard is corresponding, and confirm that said primary importance relation is: the distance value between the bebinning character of first retrieval participle and destination document described in the destination document is in said distance value scope;
When not confirming as not, confirm that said first retrieval participle position in destination document and the required satisfied primary importance relation of the reference position of destination document are: first retrieves the reference position that participle is positioned at destination document.
Further, said condition analysis unit 31 specifically can be used for:
Confirm in said search key, whether to have asterisk wildcard after last retrieval participle;
Confirming as when being; Corresponding relation according to predefined asterisk wildcard type and position distance value; Confirm the distance value scope that said asterisk wildcard is corresponding, and confirm that said second place relation is: the distance value between the ending character of last retrieval participle and destination document described in the destination document is in said distance value scope;
When not confirming as not, confirm that said last retrieval participle position in destination document and the required satisfied second place relation of the end position of destination document are: last retrieves the end position that participle is positioned at destination document.
When said asterisk wildcard was asterisk, said distance value scope was to be not less than 0 integer; When said asterisk wildcard was question mark, the distance value scope that said question mark is corresponding was 0 or 1.
Further, said file search unit 32 also can be used for:
Confirm whether consistent the position relation of participle in the document that finds of respectively retrieving that said search key comprises concerns with position in said search key;
Further, said result returns unit 33 and specifically can be used for:
, the information of the document that finds is returned as result for retrieval during in the position relation in the document that participle finding respectively retrieved of confirming that said search key comprised with position relationship consistency in said search key.
Further, this device also comprises:
Sort result unit 34, the retrieval participle that is used for comprising according to said search key sorts to each document that finds at the position of each document that finds and the data length of each document that finds;
Accordingly, said result returns unit 33 and specifically can be used for:
According to the ranking results that each document that finds is sorted, each document that finds is returned as result for retrieval.
Further, said sort result unit 34 specifically can be used for:
According to predefined file correlation value computing formula, calculate the relevance degree of each document that finds respectively; Said file correlation value computing formula meets the following conditions: the position of retrieval participle in document that said search key comprised is forward more; The relevance degree that calculates according to the document relevance degree computing formula is big more; The data length of document is more little, and the relevance degree that calculates according to the document relevance degree computing formula is big more;
Size order according to the relevance degree of each document that calculates sorts each document.
Said file correlation value computing formula can for:
scord ( d ) = Σ t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
ExactNorm ( len ) = 1.0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, and pos (t) is the positional value of t retrieval participle in document that said search key comprised, and N is the number of the retrieval participle that said search key comprised.
Referring to Fig. 4, the embodiment of the invention also provides a kind of document retrieving apparatus, and this device comprises:
Document retrieving unit 40 is used for retrieving a plurality of documents of the whole retrieval participles that comprise search key;
Sort result unit 41 is used for according to the position of retrieval participle in a plurality of documents that retrieve of said search key and the data length of a plurality of documents that retrieve a plurality of documents that retrieve being sorted;
The result returns unit 42, is used for the ranking results that sorts according to a plurality of documents that retrieve, and a plurality of documents that retrieve are returned as result for retrieval.
Further, said sort result unit 41 specifically can be used for:
According to predefined file correlation value computing formula, calculate the relevance degree of a plurality of documents that retrieve respectively; Said file correlation value computing formula meets the following conditions: the position of retrieval participle in document that said search key comprised is forward more; The relevance degree that calculates according to the document relevance degree computing formula is big more; The data length of document is more little, and the relevance degree that calculates according to the document relevance degree computing formula is big more;
Size order according to the relevance degree of a plurality of documents that calculate should sort by a plurality of documents.
Said file correlation value computing formula can for:
scord ( d ) = Σ t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
ExactNorm ( len ) = 1.0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, and pos (t) is the positional value of t retrieval participle in document that said search key comprised, and N is the number of the retrieval participle that said search key comprised.
Further, said document retrieving unit 40 comprises:
The request receiving element is used to receive the retrieval request that comprises search key;
The condition analysis unit is used for confirming the retrieval participle that said search key is comprised and the specific retrieval participle condition that residing position need be satisfied in destination document of this retrieval participle;
The file search unit is used to search the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place;
The result returns the unit, is used for the information of a plurality of documents that find is returned as result for retrieval.
Further, said condition analysis unit is used for:
When said specific retrieval participle comprises first retrieval participle that said search key comprises; According to the form of said search key, confirm the position of first retrieval participle in destination document and the required satisfied primary importance relation of reference position of destination document that said search key comprises;
Accordingly, said file search unit is used for:
Search and comprise the document that definite retrieval participle and said first retrieval participle satisfy said primary importance relation.
Further, said condition analysis unit is used for:
When said specific retrieval participle comprises last retrieval participle that said search key comprises; According to the form of said search key, confirm the position of last retrieval participle in destination document and the required satisfied second place relation of end position of destination document that said search key comprises;
Accordingly, said file search unit is used for:
Search and comprise the document that definite retrieval participle and said last retrieval participle satisfy said second place relation.
Further, said condition analysis unit specifically can be used for:
Form according to said search key; Confirm the position of first retrieval participle in destination document and the required satisfied primary importance relation of reference position of destination document that said search key comprises, and the position of last retrieval participle that said search key comprised in destination document and the required satisfied second place relation of end position of destination document;
Accordingly, said file search unit specifically can be used for:
Search and comprise definite retrieval participle and said first retrieval participle and satisfy the document that said primary importance relation and said last retrieval participle satisfy said second place relation.
Further, said condition analysis unit specifically can be used for:
Confirm in said search key, whether to have asterisk wildcard before first retrieval participle;
Confirming as when being; Corresponding relation according to predefined asterisk wildcard type and distance value scope; Confirm the position distance value that said asterisk wildcard is corresponding, and confirm that said primary importance relation is: the distance value between the bebinning character of first retrieval participle and destination document described in the destination document is in said distance value scope;
When not confirming as not, confirm that said first retrieval participle position in destination document and the required satisfied primary importance relation of the reference position of destination document are: first retrieves the reference position that participle is positioned at destination document.
Further, said condition analysis unit specifically can be used for:
Confirm in said search key, whether to have asterisk wildcard after last retrieval participle;
Confirming as when being; Corresponding relation according to predefined asterisk wildcard type and position distance value; Confirm the distance value scope that said asterisk wildcard is corresponding, and confirm that said second place relation is: the distance value between the ending character of last retrieval participle and destination document described in the destination document is in said distance value scope;
When not confirming as not, confirm that said last retrieval participle position in destination document and the required satisfied second place relation of the end position of destination document are: last retrieves the end position that participle is positioned at destination document.
When said asterisk wildcard was asterisk, said distance value scope was to be not less than 0 integer; When said asterisk wildcard was question mark, the distance value scope that said question mark is corresponding was 0 or 1.
Further, said file search unit also can be used for:
Confirm whether consistent the position relation of participle in the document that finds of respectively retrieving that said search key comprises concerns with position in said search key;
Accordingly, said result returns the unit and specifically can be used for:
, the information of the document that finds is returned as result for retrieval during in the position relation in the document that participle finding respectively retrieved of confirming that said search key comprised with position relationship consistency in said search key.
To sum up, beneficial effect of the present invention comprises:
In the scheme that the embodiment of the invention provides; After receiving the retrieval request that comprises search key; Confirm retrieval participle that said search key comprises and the specific retrieval participle condition that residing position need be satisfied in destination document in this retrieval participle; Search the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place, and the information of the document that finds is returned as result for retrieval.It is thus clear that the document of certain condition is satisfied in the position of adopting the present invention can retrieve the retrieval participle, also promptly can retrieve the document that the retrieval participle occurs at certain position, and then make result for retrieval more accurate, better meet user's request.
In the scheme that the embodiment of the invention provides; After retrieving a plurality of documents that comprise the whole retrieval participles in the search key; According to the data length of the position of the retrieval participle in the said search key in a plurality of documents that retrieve with a plurality of documents that retrieve; A plurality of documents to retrieving sort, and according to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.It is thus clear that, adopt the present invention, can result for retrieval be sorted according to position and the data length of document that the retrieval participle appears in the document, and then make that the ranking results of search file is more accurate, better meet user's request.
The present invention is that reference is described according to the process flow diagram and/or the block scheme of method, equipment (system) and the computer program of the embodiment of the invention.Should understand can be by the flow process in each flow process in computer program instructions realization flow figure and/or the block scheme and/or square frame and process flow diagram and/or the block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, make the instruction of carrying out through the processor of computing machine or other programmable data processing device produce to be used for the device of the function that is implemented in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.
These computer program instructions also can be stored in ability vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work; Make the instruction that is stored in this computer-readable memory produce the manufacture that comprises command device, this command device is implemented in the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.
These computer program instructions also can be loaded on computing machine or other programmable data processing device; Make on computing machine or other programmable devices and to carry out the sequence of operations step producing computer implemented processing, thereby the instruction of on computing machine or other programmable devices, carrying out is provided for being implemented in the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.
Although described the preferred embodiments of the present invention, in a single day those skilled in the art get the basic inventive concept could of cicada, then can make other change and modification to these embodiment.So accompanying claims is intended to be interpreted as all changes and the modification that comprises preferred embodiment and fall into the scope of the invention.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, belong within the scope of claim of the present invention and equivalent technologies thereof if of the present invention these are revised with modification, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1. a document retrieval method is characterized in that, this method comprises:
After retrieving a plurality of documents of the retrieval participle that comprises in the search key,, a plurality of documents that retrieve are sorted according to the data length of the position of the retrieval participle in the said search key in a plurality of documents that retrieve with a plurality of documents that retrieve;
According to the ranking results that a plurality of documents that retrieve are sorted, a plurality of documents that retrieve are returned as result for retrieval.
2. the method for claim 1 is characterized in that, and is said according to the data length of the position of the retrieval participle in the said search key in a plurality of documents that retrieve with a plurality of documents that retrieve, and a plurality of documents that retrieve are sorted to be comprised:
According to predefined file correlation value computing formula, calculate the relevance degree of a plurality of documents that retrieve respectively; Said file correlation value computing formula meets the following conditions: the position of retrieval participle in document that said search key comprised is forward more; The relevance degree that calculates according to the document relevance degree computing formula is big more; The data length of document is more little, and the relevance degree that calculates according to the document relevance degree computing formula is big more;
Size order according to the relevance degree of a plurality of documents that calculate should sort by a plurality of documents.
3. method as claimed in claim 2 is characterized in that, said file correlation value computing formula is:
scord ( d ) = Σ t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
ExactNorm ( len ) = 1.0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, and pos (t) is the positional value of t retrieval participle in document that said search key comprised, and N is the number of the retrieval participle that said search key comprised.
4. like arbitrary described method among the claim 1-3, it is characterized in that a plurality of documents that retrieval comprises the retrieval participle in the search key comprise:
Reception comprises the retrieval request of search key;
Confirm the retrieval participle that said search key comprises, and confirm the specific retrieval participle condition that residing position need be satisfied in destination document in this retrieval participle;
Search the document that said condition is satisfied in the position that comprises definite retrieval participle and specific retrieval participle place;
The information of a plurality of documents that find is returned as result for retrieval.
5. method as claimed in claim 4; It is characterized in that; When said specific retrieval participle comprises first retrieval participle that said search key comprises, saidly confirm that the specific retrieval participle condition that residing position need be satisfied in destination document in this retrieval participle comprises:
According to the form of said search key, confirm the position of first retrieval participle in destination document and the required satisfied primary importance relation of reference position of destination document that said search key comprises;
The said document that satisfies said condition in the position that comprises definite retrieval participle and specific retrieval participle place of searching comprises:
Search and comprise the document that definite retrieval participle and said first retrieval participle satisfy said primary importance relation.
6. method as claimed in claim 4; It is characterized in that; When said specific retrieval participle comprises last retrieval participle that said search key comprises, saidly confirm that the specific retrieval participle condition that residing position need be satisfied in destination document in this retrieval participle comprises:
According to the form of said search key, confirm the position of last retrieval participle in destination document and the required satisfied second place relation of end position of destination document that said search key comprises;
The said document that satisfies said condition in the position that comprises definite retrieval participle and specific retrieval participle place of searching comprises:
Search and comprise the document that definite retrieval participle and said last retrieval participle satisfy said second place relation.
7. a document retrieving apparatus is characterized in that, this device comprises:
Document retrieving unit is used for retrieving a plurality of documents of the whole retrieval participles that comprise search key;
The sort result unit is used for according to the position of retrieval participle in a plurality of documents that retrieve of said search key and the data length of a plurality of documents that retrieve a plurality of documents that retrieve being sorted;
The result returns the unit, is used for the ranking results that sorts according to a plurality of documents that retrieve, and a plurality of documents that retrieve are returned as result for retrieval.
8. device as claimed in claim 7 is characterized in that, said sort result unit is used for:
According to predefined file correlation value computing formula, calculate the relevance degree of a plurality of documents that retrieve respectively; Said file correlation value computing formula meets the following conditions: the position of retrieval participle in document that said search key comprised is forward more; The relevance degree that calculates according to the document relevance degree computing formula is big more; The data length of document is more little;, the relevance degree that calculates according to the document relevance degree computing formula is big more;
Size order according to the relevance degree of a plurality of documents that calculate should sort by a plurality of documents.
9. device as claimed in claim 8 is characterized in that, said file correlation value computing formula is:
scord ( d ) = Σ t = 1 N 1 2 * pos ( t ) * 1.0 + ExactNorm ( len ) 2 ;
ExactNorm ( len ) = 1.0 len + 1 ;
Wherein, scord (d) is the file correlation value of document d; Len is the data length of document, and pos (t) is the positional value of t retrieval participle in document that said search key comprised, and N is the number of the retrieval participle that said search key comprised.
10. like arbitrary described device among the claim 7-9, it is characterized in that said document retrieving unit comprises:
The request receiving element is used to receive the retrieval request that comprises search key;
The condition analysis unit is used for confirming the retrieval participle that said search key is comprised and the specific retrieval participle condition that residing position need be satisfied in destination document of this retrieval participle;
The file search unit is used to search the document that said condition is satisfied in the position that comprises definite whole retrieval participles and specific retrieval participle place;
The result returns the unit, is used for the information of a plurality of documents that find is returned as result for retrieval.
CN201010621819.1A 2010-12-27 2010-12-27 Document retrieval method and device Expired - Fee Related CN102567420B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010621819.1A CN102567420B (en) 2010-12-27 2010-12-27 Document retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010621819.1A CN102567420B (en) 2010-12-27 2010-12-27 Document retrieval method and device

Publications (2)

Publication Number Publication Date
CN102567420A true CN102567420A (en) 2012-07-11
CN102567420B CN102567420B (en) 2014-03-12

Family

ID=46412849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010621819.1A Expired - Fee Related CN102567420B (en) 2010-12-27 2010-12-27 Document retrieval method and device

Country Status (1)

Country Link
CN (1) CN102567420B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022794A (en) * 2015-06-26 2015-11-04 广州时韵信息科技有限公司 Method and apparatus for fast searching for required article contents
CN107346325A (en) * 2016-05-04 2017-11-14 中国石油集团长城钻探工程有限公司 Information query method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
CN101030217A (en) * 2007-03-22 2007-09-05 华中科技大学 Method for indexing and acquiring semantic net information
CN101206672A (en) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 Commercial articles searching non result intelligent processing system and method
CN101344890A (en) * 2008-08-22 2009-01-14 清华大学 Grading method for information retrieval document based on viewpoint searching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
CN101030217A (en) * 2007-03-22 2007-09-05 华中科技大学 Method for indexing and acquiring semantic net information
CN101206672A (en) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 Commercial articles searching non result intelligent processing system and method
CN101344890A (en) * 2008-08-22 2009-01-14 清华大学 Grading method for information retrieval document based on viewpoint searching

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022794A (en) * 2015-06-26 2015-11-04 广州时韵信息科技有限公司 Method and apparatus for fast searching for required article contents
CN107346325A (en) * 2016-05-04 2017-11-14 中国石油集团长城钻探工程有限公司 Information query method and device

Also Published As

Publication number Publication date
CN102567420B (en) 2014-03-12

Similar Documents

Publication Publication Date Title
CN102567421B (en) Document retrieval method and device
US8838650B2 (en) Method and apparatus for preprocessing a plurality of documents for search and for presenting search result
CN110852097B (en) Feature word extraction method, text similarity calculation method, device and equipment
US8321409B1 (en) Document ranking using word relationships
CN108520002A (en) Data processing method, server and computer storage media
CN103425687A (en) Retrieval method and system based on queries
CN102156711B (en) Cloud storage based power full text retrieval method and system
EP2631815A1 (en) Method and device for ordering search results, method and device for providing information
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN109492150B (en) Reverse nearest neighbor query method and device based on semantic track big data
CN107844493B (en) File association method and system
CN109446410A (en) Knowledge point method for pushing, device and computer readable storage medium
CN107679208A (en) A kind of searching method of picture, terminal device and storage medium
JP6722615B2 (en) Query clustering device, method, and program
CN101320382A (en) Method and system for rearranging search result based on context
CN111078835A (en) Resume evaluation method and device, computer equipment and storage medium
US20170185671A1 (en) Method and apparatus for determining similar document set to target document from a plurality of documents
US20080065618A1 (en) Indexing for rapid database searching
EP3301603A1 (en) Improved search for data loss prevention
CN101840438B (en) Retrieval system oriented to meta keywords of source document
CN102270201A (en) Multi-dimensional indexing method and device for network files
CN102567420B (en) Document retrieval method and device
CN109814923A (en) Data processing method, device, computer equipment and storage medium
CN110008407B (en) Information retrieval method and device
Yadav et al. Efficient methods to generate inverted indexes for ir

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220620

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: FOUNDER APABI TECHNOLOGY Ltd.

Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: FOUNDER APABI TECHNOLOGY Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140312

CF01 Termination of patent right due to non-payment of annual fee